broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Validate all existing test BAMs #569

Open lbergelson opened 9 years ago

lbergelson commented 9 years ago

We've discovered a number of bam files being used in tests which are not valid BAM files. We should go through all the checked in BAMs, validate them, and replace broken ones. (Except ones that are intentionally broken for testing.)

(added later by @akiezun) In particular, copied from https://github.com/broadinstitute/hellbender/issues/568, NA12878.chr17_69k_70k.dictFix.bam has a problem: whoever fixes this ticket needs to take care of this input.

htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 129, Read name 809R9ABXX101220:5:6:17918:145992, Mate Alignment start should be 0 because reference name = *.

Here's the corresponding read: 809R9ABXX101220:5:6:17918:145992 97 17 69400 37 67M9S * 71202348 0 ACTCCCCACCTTACCTGACTCCTTCCAGGGTTTGTCGCCTTTCCGGTCCCTGACCCCAGTGGATGGGAGTCTGTCC ?ABDDEEABEECBDBDAB=DEDCDEEBFADABCEAD?EEEDCFE?ABEEE@FCDEEEBF@F?CA4@@########## UQ:i:0

nenewell commented 9 years ago

I'll work on this.

akiezun commented 9 years ago

@nenewell go for it. it's yours. I don't seem to be able to 'assign' it to you.

lbergelson commented 9 years ago

@nenewell I took a brief look yesterday and it looks like there are many bams that fail to validate due to errors with mate pairs. I'm not sure how concerned we are about those errors. It's the weirder ones like

htsjdk.samtools.SAMException: SAMFormatException on record 01
    at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:308)
    at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:199)
    at htsjdk.samtools.SamFileValidator.validateSamFileVerbose(SamFileValidator.java:159)
    at org.broadinstitute.hellbender.tools.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:129)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:97)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgram.instanceMain(PicardCommandLineProgram.java:51)
    at org.broadinstitute.hellbender.Main.instanceMain(Main.java:71)
    at org.broadinstitute.hellbender.Main.main(Main.java:86)
Caused by: htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File src/test/resources/org/broadinstitute/hellbender/tools/picard/analysis/CompareMetrics/colqualyldmtcs.bam; Line 1

That seem particularly concerning.

lbergelson commented 9 years ago

I did this yesterday but didn't look into all the output.

find src/test/resources -name "*.bam" > bams

for x in $(cat bams):
do build/install/hellbender/bin/hellbender ValidateSamFile -I $x  >> validateBamOutput 2>&1
done

output is here: https://gist.github.com/lbergelson/002289ce94b0eace183c

lbergelson commented 9 years ago

@nenewell If that's helpful feel free to make use of it, if it isn't feel free to discard it.

nenewell commented 9 years ago

Thanks very much, that is helpful.

From: Louis Bergelson Sent: Wednesday, June 17, 2015 10:42 AM To: broadinstitute/hellbender Cc: nenewell Subject: Re: [hellbender] Validate all existing test BAMs (#569)

@nenewell If that's helpful feel free to make use of it, if it isn't feel free to discard it.

— Reply to this email directly or view it on GitHub.

nenewell commented 9 years ago

I'm new at this, so I may be missing something, but it looks to me like there is an issue with the conversion code in GenomicsConverter.makeSAMRecord() in com.google.cloud.genomics.gatk.common. I've been working on this bam file issue, correcting errors in the files used for tests. Many of the errors involve reads with FLAGs that indicate that they are in pairs, but the mate is not extant in the file, causing the error. A way to fix this without deleting the offending reads is to set the FLAG to zero and also modify the RNEXT, PNEXT, and TLEN fields, if necessary, so that the read becomes single (provided that the values of all of these fields are not important for the tests). However, when I do this, I find that tests that write and then read bam files fail, because when the just-written file is read back, SAM validation complains that the mate unmapped FLAG is set for an unpaired read. It turns out that the copy of the file written by the test substitutes the value '8' for '0' as the FLAG for the modified reads.

The relevant code in GenomicsConvertermakeSamRecord() (line 170) is:

flags += ((read.getNextMatePosition() == null || read.getNextMatePosition.getPosition() == null)) ? 8 : 0;

The effect of this line is that all reads which have null mate positions, even those which the FLAG specifies as unpaired, get the mate unmapped FLAG set, causing the validation errors that i'm seeing. The reason the tests have not failed before is apparently that the existing test files do not contain any reads with FLAGs that specify them as unpaired.

A simple fix for this would be to convert the line above to:

flags += ( paired && (read.getNextMatePosition() == null || read.getNextMatePosition.getPosition() == null)) ? 8 : 0;

The redundant parens in the original code suggest that something like this may have been intended,but the google genomics documentation at http://google-genomics.readthedocs.org/en/latest/migrating_tips.html gives the following pseudocode:

flags += read.nextMatePosition.position == null ? 8 : 0 #mate_unmapped

so it looks like the doc supports the existing code.

Should I submit this as an issue?

Thank you.

lbergelson commented 9 years ago

@nenewell If I understand what is going on correctly I think this may be a bug in the validator. Sam spec says

• If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.

Bit Description
1 0x1 template having multiple segments in sequencing
2 0x2 each segment properly aligned according to the aligner
4 0x4 segment unmapped
8 0x8 next segment in the template unmapped
16 0x10 SEQ being reverse complemented
32 0x20 SEQ of the next segment in the template being reverse complemented
64 0x40 the first segment in the template
128 0x80 the last segment in the template
256 0x100 secondary alignment
512 0x200 not passing quality controls
1024 0x400 PCR or optical duplicate
2048 0x800 supplementary alignment

So it shouldn't matter what google is setting the mate flags to if the read is a singleton. I know there are some other issues with things like this, I think SAMRecord.equals() isn't properly agnostic to certain flag combinations and will incorrectly evaluate as not equal even in some cases where the read is logically identical.

Does that make sense or have I missed something?

nenewell commented 9 years ago

Your interpretation sounds right, although I wish the language in the SAM spec were clearer - something like "if 0x1 is unset, fields 0x2, 0x8, 0x20, 0x40, and 0x80 have no meaning and are ignored by the tools". SAMRecord.IsValid() returns errors not only for 0x8(mate unmapped)/unpaired read, but also for the other four fields, so all of these errors would need to be removed. IsValid() also triggers an error when an unpaired read has RNEXT set, but the spec. doesn't appear to exclude this error. No error is triggered for the unpaired/PNEXT case.

So I agree that it looks like IsValid() should be changed to align with the spec. But I can also imagine potential pitfalls of leaving the GenomicsConverter code the way it is. The code adds spurious information to the bam that might cause problems with legacy versions of the tools. I don't know what the plans are for the state of the existing tool distribution once gatk4 is released. Is it possible that bam files produced in the cloud could make their way into gatk3 workflows, maybe via the sharing of bams between groups? If this happens, then unpaired reads processed in the cloud, with their mate unpaired flags set by the converter, could trigger validation errors when they are fed to the legacy tools. Is there a downside to altering the converter code as well as modifying the validator (I don't know what altering google packages entails...)

lbergelson commented 9 years ago

Thanks for looking into this. It's fiddly stuff. You're definitely right about it causing unnecessary complications. It's certain that we'll see comparison between bams processed locally and in the cloud, at the least just in testing our own code. Minor differences like this already have been causing pain for testing. I think altering the google code would be a good thing.

The best thing to do is is to submit a pull request against their repository, with an explanation that the changes are equally correct, but more in line with existing bams. I suspect they will accept that. Would you be up for taking that on? The repo is gatk-tools-java.

I've opened ticket #592 here to fix SamFileValidator as well.

nenewell commented 9 years ago

Yes - I'd be happy to do both of those.Thank you.

nenewell commented 9 years ago

Cleaning the bam (and the sam) test files is almost done - it's taken some time because some have been like mini debugging exercises in themselves, followed by the capturing and substitution of the new expected output files for the tests. There are a few issues that still remain, however, and unfortunately I am out of time - I'm headed to an overseas meeting on Monday and will be out for two weeks. I had hoped to finish before the trip, but my BMC Bioinformatics paper came through so I had to spend time on proofs etc.. I will resume ASAP after I get back.