Closed sooheelee closed 7 years ago
Hey @yfarjoun, was told to bring this to your attention.
The given sam file is invalid (in new and exciting ways that validate sam doesn't detect!)
The problems were:
After fixing these problems by hand I ran SetNmUqAndMD on it and then ValidateSam (VERBOSE mode).
I got no errors.
FWIW I maintain a pipeline that aligns to the full hs38DH with the HLA contigs in it and runs ValidateSamFile on every BAM file coming out the far end. I've had no issues with ValidateSamFile and odd reference names.
@yfarjoun Please see the forum thread as it shows how the SAM file does validate by ValidateSamFiles under other situations.
Here is something interesting. If I modify the mini-reference to contain a different starting contig, then ValidateSamFiles gives a slightly different error message.
Before, for the NM tag (nucleotide differences) in file [0] does not match reality [1]
verbose message, the numbers for the reads were [1], [2], [5], [5] and [5]. These match with what the user sees and with the ordering of the HLA contigs.
If I remove the first three contigs of the twelve contig reference, then run ValidateSamFiles, I get these numbers for the error message: [2], [3], [6], [6] and [6]:
WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38.sam R=hla_3.fa MODE=VERBOSE
[Fri Jun 09 15:35:10 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38.sam MODE=VERBOSE REFERENCE_SEQUENCE=hla_3.fa MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 15:35:10 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [2]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [3]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [6]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [6]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [6]
[Fri Jun 09 15:35:10 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
So ValidateSamFile does something differently for the NM tag accounting depending on the reference contigs (with similar names).
I should stress the user's alignments come from Novoalign and their workflow differs from our production workflow. Nevertheless, these reads, no matter the tags absent, appear valid and ValidateSamFile should validate them. And it does, given the mini-matching contig reference but not the GRCh38 reference.
Yossi points out my header is improper and this is indeed the case. My test SAM's header only contains one @SQ
line for one contig.
If I fix the header to contain all twelve contigs, then ValidateSamFile validates the SAM!
WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla12_test_hg38.sam R=hla_2.fa MODE=VERBOSE
[Fri Jun 09 15:51:24 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla12_test_hg38.sam MODE=VERBOSE REFERENCE_SEQUENCE=hla_2.fa MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 15:51:24 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
No errors found
[Fri Jun 09 15:51:25 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
@SQ
line.[A] + [2]
WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38_fix.bam MODE=VERBOSE R=hla_2.fa
[Fri Jun 09 17:00:11 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38_fix.bam MODE=VERBOSE REFERENCE_SEQUENCE=hla_2.fa MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 17:00:11 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [5]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [5]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [5]
[Fri Jun 09 17:00:11 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
[A] + [3]
WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38_fix.bam MODE=VERBOSE R=hla_3.fa
[Fri Jun 09 17:01:05 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38_fix.bam MODE=VERBOSE REFERENCE_SEQUENCE=hla_3.fa MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 17:01:05 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [2]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [3]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [6]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [6]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [6]
[Fri Jun 09 17:01:05 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
User from forum thread is not likely using this type of artificial setup that I've concocted for testing purposes. So the error they see could well be from a full header situation.
It appears that for this HLA contig case, that when the SAM header contains a subset of the contigs in the reference, a situation that we allow for in our tools, that the validation logic is different. Because we allow this non-exact subset header contig list situation (I assume this because ValidateSamFile did not complain), either we should (i) not allow such differences, (ii) we should fix the validation bug for this type of scenario for ValidateSamFile, or (iii) we should carefully document this behavior in the tool doc for users to beware.
I assume we allow for non-exact subset header contig lists because in the past we have wanted to be able to drop EBV contigs and mito contigs.
If you want a tech writer's opinion, I think it prudent to allow for the subset-header-contig-list-situation because of just how long GRCh38's contig list is and how nice it would be to take around files indicating the full reference hash (e.g. as in CRAM validation) but only containing the @SQ
lines pertinent for the alignments within the file.
Ok, I feel that I'm beating a dead horse here but in case you are not convinced. Here are two more results. These references aren't in the zipped bundle I uploaded but you can recreate them easily.
WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38_fix.bam MODE=VERBOSE R=hla_5.fa
[Fri Jun 09 17:23:47 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38_fix.bam MODE=VERBOSE REFERENCE_SEQUENCE=hla_5.fa MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 17:23:47 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [72]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [73]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [74]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [73]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [73]
Finally, we can confirm these NM tag (nucleotide differences) in file [0] does not match reality [72]
numbers correspond to the actual mismatches between the read alignment information and the mistaken identity contig using BLAST.
HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239 0 HLA-A*11:50Q 1091 30 101M * 0 0 CGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGG @?>?=>?@>@@@@==@@>>>>@>>@@@@@>@??@@>@@>@@>@@@@>@@>@@>@@@@@?@@@=@>=AA@>@@>@>@>>?>@?>>?@@@>>@>@@@>@?>?? ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:101 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:2 NM:i:0 SM:i:2 PQ:i:1 UQ:i:0 AS:i:0 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756 0 HLA-A*11:50Q 1092 30 101M * 0 0 GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAACGCAAGTGGGAGGCGGC ?@@>=@@8@@@?==?@<==<@9<@5@@@>@>=@@9??>@@?@@@@>@>=@@<9?@<?><??<?<>>:>.??==:>>>@=9>=>.>>?==?=@??9>>?9=@ ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:83G17 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:2 NM:i:1 SM:i:2 PQ:i:26 UQ:i:13 AS:i:18 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908 0 HLA-A*11:50Q 1101 30 101M * 0 0 GGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCCCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGC ?>@;<@@>===?==@@@@@=@=>@?=@@>@@>@>@@=@@=@A=@@@@@?@@@=9==@>@=@?+<=@<:?=@?/2<>9:/;;;@@@>=:?@@@==>;>?@@> ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:62T38 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:1 NM:i:1 SM:i:3 PQ:i:13 UQ:i:10 AS:i:
[A] + [2]'s first contig is HLA-A*11:01:01
The first three reads are said to mismatch by 1, 2 and 5 in reality and we BLAST each against HLA-A*11:01:01
and see:
[A] + [3]'s first contig is HLA-A*11:110
The first three reads are said to mismatch by 2, 3 and 6 in reality and we BLAST each against HLA-A*11:110
to see:
These correlations strongly suggest ValidateSamFile uses the first contig it sees in similarly named contigs to validate NM tags against.
Without having run SetNmUqAndMd I'm not sure what this is proving...except that there are values in the bam that are wrong. sound like a user error to me, not a bug.
The values in the BAM are correct. The NM values match the alignments.
The cases where the contigs in the dict do not match the contigs in the header are irrelevant. The program should blowup on them and I'm sorry it doesn't (it will soon). the references are looked up by their index and so if the index in the bam doesn't match the index in the reference you'll get garbage.
The only cases that this is honored are A + [1] and B + [2] and they indeed validate.
I'm not sure what else I can say. I opened a ticket in HtsJdk that will make ValidateSam be more strict (force dictionaries to match for example) https://github.com/samtools/htsjdk/issues/899
I still do not see a bug here. And most significantly, this has nothing to do with HLA. you can reproduce this by removing any contig from the header..the contigs that follow will be messed up. I suspect that your user removed one (or more) of the contigs between the autosomals and HLA....
Why can't the tool match on exact entire contig name? GATK had this bug where it interpreted colons as the end of a contig name and what followed the colons as genomic positions. The bug was unanticipated because using colons in contig names is new to GRCh38. I would like us to fix the source of this mismatch behavior.
this has nothing to do with HLA. you can reproduce this by removing any contig from the header..the contigs that follow will be messed up.
Ok. If you are certain of the above, then I will agree with you.
Come to think of it, for my test data, the indexes and dictionaries were generated exactly on each reference FASTA and are correct. My BAM doesn't have an index. So FASTA content should match exactly the corresponding index and dictionary content.
Matching by index is faster than string comparison.
this has nothing to do with the colons despite the fact the gatk had a bug regarding colons.
I'm not taking about a .bai file. I mean the index number or position of the contig in the dictionary. Which is why the dictionaries must match.
On Jun 11, 2017 4:42 PM, "sooheelee" notifications@github.com wrote:
Come to think of it, for my test data, the indexes were generated exactly on each reference FASTA and are correct. My BAM doesn't have an index.
— You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/833#issuecomment-307655488, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0utImOubOjlB63sKE2R6MrDpWkxUks5sDFFQgaJpZM4N1pPM .
There is no BAI. But YF is referring to the implicit index in the BAM header @SQ lines.
Bug Report
Affected tool(s)
ValidateSamFile and possibly other Picard tools with the same dependencies
Affected version(s)
Description
A user and I describe the problem in http://gatkforums.broadinstitute.org/gatk/discussion/comment/39347#Comment_39347. This error sounds eerily similar to that in GATK which was fixed for GATK v3.6.
Here is a description of the behavior for GATK:
Steps to reproduce
Here's the data I made for testing: shlee_hla_test.tar.gz
Here are the commands that produce the error:
[E] With reference hla_2.fa, SUMMARY mode, gives ERROR
[F] With reference hla_2.fa, VERBOSE mode, gives ERROR and all five reads
Expected behavior
Tool should identify correct GRCh38 alternate contig to match NM tag on and pass these reads as valid
Actual behavior
Tool errors saying NM tag values do not match