Open lbergelson opened 9 years ago
I'll work on this.
@nenewell go for it. it's yours. I don't seem to be able to 'assign' it to you.
@nenewell I took a brief look yesterday and it looks like there are many bams that fail to validate due to errors with mate pairs. I'm not sure how concerned we are about those errors. It's the weirder ones like
htsjdk.samtools.SAMException: SAMFormatException on record 01
at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:308)
at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:199)
at htsjdk.samtools.SamFileValidator.validateSamFileVerbose(SamFileValidator.java:159)
at org.broadinstitute.hellbender.tools.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:129)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:97)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgram.instanceMain(PicardCommandLineProgram.java:51)
at org.broadinstitute.hellbender.Main.instanceMain(Main.java:71)
at org.broadinstitute.hellbender.Main.main(Main.java:86)
Caused by: htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File src/test/resources/org/broadinstitute/hellbender/tools/picard/analysis/CompareMetrics/colqualyldmtcs.bam; Line 1
That seem particularly concerning.
I did this yesterday but didn't look into all the output.
find src/test/resources -name "*.bam" > bams
for x in $(cat bams):
do build/install/hellbender/bin/hellbender ValidateSamFile -I $x >> validateBamOutput 2>&1
done
output is here: https://gist.github.com/lbergelson/002289ce94b0eace183c
@nenewell If that's helpful feel free to make use of it, if it isn't feel free to discard it.
Thanks very much, that is helpful.
From: Louis Bergelson Sent: Wednesday, June 17, 2015 10:42 AM To: broadinstitute/hellbender Cc: nenewell Subject: Re: [hellbender] Validate all existing test BAMs (#569)
@nenewell If that's helpful feel free to make use of it, if it isn't feel free to discard it.
— Reply to this email directly or view it on GitHub.
I'm new at this, so I may be missing something, but it looks to me like there is an issue with the conversion code in GenomicsConverter.makeSAMRecord() in com.google.cloud.genomics.gatk.common. I've been working on this bam file issue, correcting errors in the files used for tests. Many of the errors involve reads with FLAGs that indicate that they are in pairs, but the mate is not extant in the file, causing the error. A way to fix this without deleting the offending reads is to set the FLAG to zero and also modify the RNEXT, PNEXT, and TLEN fields, if necessary, so that the read becomes single (provided that the values of all of these fields are not important for the tests). However, when I do this, I find that tests that write and then read bam files fail, because when the just-written file is read back, SAM validation complains that the mate unmapped FLAG is set for an unpaired read. It turns out that the copy of the file written by the test substitutes the value '8' for '0' as the FLAG for the modified reads.
The relevant code in GenomicsConvertermakeSamRecord() (line 170) is:
flags += ((read.getNextMatePosition() == null || read.getNextMatePosition.getPosition() == null)) ? 8 : 0;
The effect of this line is that all reads which have null mate positions, even those which the FLAG specifies as unpaired, get the mate unmapped FLAG set, causing the validation errors that i'm seeing. The reason the tests have not failed before is apparently that the existing test files do not contain any reads with FLAGs that specify them as unpaired.
A simple fix for this would be to convert the line above to:
flags += ( paired && (read.getNextMatePosition() == null || read.getNextMatePosition.getPosition() == null)) ? 8 : 0;
The redundant parens in the original code suggest that something like this may have been intended,but the google genomics documentation at http://google-genomics.readthedocs.org/en/latest/migrating_tips.html gives the following pseudocode:
flags += read.nextMatePosition.position == null ? 8 : 0 #mate_unmapped
so it looks like the doc supports the existing code.
Should I submit this as an issue?
Thank you.
@nenewell If I understand what is going on correctly I think this may be a bug in the validator. Sam spec says
• If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.
Bit | Description |
---|---|
1 0x1 | template having multiple segments in sequencing |
2 0x2 | each segment properly aligned according to the aligner |
4 0x4 | segment unmapped |
8 0x8 | next segment in the template unmapped |
16 0x10 | SEQ being reverse complemented |
32 0x20 | SEQ of the next segment in the template being reverse complemented |
64 0x40 | the first segment in the template |
128 0x80 | the last segment in the template |
256 0x100 | secondary alignment |
512 0x200 | not passing quality controls |
1024 0x400 | PCR or optical duplicate |
2048 0x800 | supplementary alignment |
So it shouldn't matter what google is setting the mate flags to if the read is a singleton. I know there are some other issues with things like this, I think SAMRecord.equals()
isn't properly agnostic to certain flag combinations and will incorrectly evaluate as not equal even in some cases where the read is logically identical.
Does that make sense or have I missed something?
Your interpretation sounds right, although I wish the language in the SAM spec were clearer - something like "if 0x1 is unset, fields 0x2, 0x8, 0x20, 0x40, and 0x80 have no meaning and are ignored by the tools". SAMRecord.IsValid() returns errors not only for 0x8(mate unmapped)/unpaired read, but also for the other four fields, so all of these errors would need to be removed. IsValid() also triggers an error when an unpaired read has RNEXT set, but the spec. doesn't appear to exclude this error. No error is triggered for the unpaired/PNEXT case.
So I agree that it looks like IsValid() should be changed to align with the spec. But I can also imagine potential pitfalls of leaving the GenomicsConverter code the way it is. The code adds spurious information to the bam that might cause problems with legacy versions of the tools. I don't know what the plans are for the state of the existing tool distribution once gatk4 is released. Is it possible that bam files produced in the cloud could make their way into gatk3 workflows, maybe via the sharing of bams between groups? If this happens, then unpaired reads processed in the cloud, with their mate unpaired flags set by the converter, could trigger validation errors when they are fed to the legacy tools. Is there a downside to altering the converter code as well as modifying the validator (I don't know what altering google packages entails...)
Thanks for looking into this. It's fiddly stuff. You're definitely right about it causing unnecessary complications. It's certain that we'll see comparison between bams processed locally and in the cloud, at the least just in testing our own code. Minor differences like this already have been causing pain for testing. I think altering the google code would be a good thing.
The best thing to do is is to submit a pull request against their repository, with an explanation that the changes are equally correct, but more in line with existing bams. I suspect they will accept that. Would you be up for taking that on? The repo is gatk-tools-java.
I've opened ticket #592 here to fix SamFileValidator
as well.
Yes - I'd be happy to do both of those.Thank you.
Cleaning the bam (and the sam) test files is almost done - it's taken some time because some have been like mini debugging exercises in themselves, followed by the capturing and substitution of the new expected output files for the tests. There are a few issues that still remain, however, and unfortunately I am out of time - I'm headed to an overseas meeting on Monday and will be out for two weeks. I had hoped to finish before the trip, but my BMC Bioinformatics paper came through so I had to spend time on proofs etc.. I will resume ASAP after I get back.
We've discovered a number of bam files being used in tests which are not valid BAM files. We should go through all the checked in BAMs, validate them, and replace broken ones. (Except ones that are intentionally broken for testing.)
(added later by @akiezun) In particular, copied from https://github.com/broadinstitute/hellbender/issues/568, NA12878.chr17_69k_70k.dictFix.bam has a problem: whoever fixes this ticket needs to take care of this input.
htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 129, Read name 809R9ABXX101220:5:6:17918:145992, Mate Alignment start should be 0 because reference name = *.
Here's the corresponding read:
809R9ABXX101220:5:6:17918:145992 97 17 69400 37 67M9S * 71202348 0 ACTCCCCACCTTACCTGACTCCTTCCAGGGTTTGTCGCCTTTCCGGTCCCTGACCCCAGTGGATGGGAGTCTGTCC ?ABDDEEABEECBDBDAB=DEDCDEEBFADABCEAD?EEEDCFE?ABEEE@FCDEEEBF@F?CA4@@########## UQ:i:0