broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
975 stars 369 forks source link

Bug: ValidateSamFile gives false errors because it cannot parse GRCh38 alternate contigs correctly #833

Closed sooheelee closed 7 years ago

sooheelee commented 7 years ago

Bug Report

Affected tool(s)

ValidateSamFile and possibly other Picard tools with the same dependencies

Affected version(s)

Description

A user and I describe the problem in http://gatkforums.broadinstitute.org/gatk/discussion/comment/39347#Comment_39347. This error sounds eerily similar to that in GATK which was fixed for GATK v3.6.

Here is a description of the behavior for GATK:

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

  • For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
  • When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
  • However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

    -L HLA-A*01:01:01:01:1+

Steps to reproduce

Here's the data I made for testing: shlee_hla_test.tar.gz

Here are the commands that produce the error:

[E] With reference hla_2.fa, SUMMARY mode, gives ERROR

WMCF9-CB5:hg38 shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38.sam MODE=SUMMARY R=hla_2.fa
[Fri Jun 09 12:35:27 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38.sam MODE=SUMMARY REFERENCE_SEQUENCE=hla_2.fa    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 12:35:27 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:INVALID_TAG_NM    5

[Fri Jun 09 12:35:27 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

[F] With reference hla_2.fa, VERBOSE mode, gives ERROR and all five reads

WMCF9-CB5:hg38 shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38.sam MODE=VERBOSE R=hla_2.fa
[Fri Jun 09 12:35:41 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38.sam MODE=VERBOSE REFERENCE_SEQUENCE=hla_2.fa    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 12:35:41 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [5]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [5]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [5]
[Fri Jun 09 12:35:41 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Expected behavior

Tool should identify correct GRCh38 alternate contig to match NM tag on and pass these reads as valid

Actual behavior

Tool errors saying NM tag values do not match

sooheelee commented 7 years ago

Hey @yfarjoun, was told to bring this to your attention.

yfarjoun commented 7 years ago

The given sam file is invalid (in new and exciting ways that validate sam doesn't detect!)

The problems were:

  1. mate information existed on unpaired record
  2. Dictionary in sam-file didn't equal .dict file of reference
  3. QUAL length did equal SEQ length
  4. CIGAR "length" was 101 while (some) reads had length 96

After fixing these problems by hand I ran SetNmUqAndMD on it and then ValidateSam (VERBOSE mode).

I got no errors.

issue833.zip

tfenne commented 7 years ago

FWIW I maintain a pipeline that aligns to the full hs38DH with the HLA contigs in it and runs ValidateSamFile on every BAM file coming out the far end. I've had no issues with ValidateSamFile and odd reference names.

sooheelee commented 7 years ago

@yfarjoun Please see the forum thread as it shows how the SAM file does validate by ValidateSamFiles under other situations.

sooheelee commented 7 years ago

Here is something interesting. If I modify the mini-reference to contain a different starting contig, then ValidateSamFiles gives a slightly different error message.

Before, for the NM tag (nucleotide differences) in file [0] does not match reality [1] verbose message, the numbers for the reads were [1], [2], [5], [5] and [5]. These match with what the user sees and with the ordering of the HLA contigs.

If I remove the first three contigs of the twelve contig reference, then run ValidateSamFiles, I get these numbers for the error message: [2], [3], [6], [6] and [6]:

WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38.sam R=hla_3.fa MODE=VERBOSE
[Fri Jun 09 15:35:10 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38.sam MODE=VERBOSE REFERENCE_SEQUENCE=hla_3.fa    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 15:35:10 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [2]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [3]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [6]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [6]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [6]
[Fri Jun 09 15:35:10 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

So ValidateSamFile does something differently for the NM tag accounting depending on the reference contigs (with similar names).

sooheelee commented 7 years ago

I should stress the user's alignments come from Novoalign and their workflow differs from our production workflow. Nevertheless, these reads, no matter the tags absent, appear valid and ValidateSamFile should validate them. And it does, given the mini-matching contig reference but not the GRCh38 reference.

sooheelee commented 7 years ago

Yossi points out my header is improper and this is indeed the case. My test SAM's header only contains one @SQ line for one contig.

sooheelee commented 7 years ago

If I fix the header to contain all twelve contigs, then ValidateSamFile validates the SAM!

WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla12_test_hg38.sam R=hla_2.fa MODE=VERBOSE
[Fri Jun 09 15:51:24 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla12_test_hg38.sam MODE=VERBOSE REFERENCE_SEQUENCE=hla_2.fa    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 15:51:24 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
No errors found
[Fri Jun 09 15:51:25 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
sooheelee commented 7 years ago

I fixed the test data: shlee_hla_test2.zip.

Here is the synopsis:

[A] + [2]

WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38_fix.bam MODE=VERBOSE R=hla_2.fa
[Fri Jun 09 17:00:11 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38_fix.bam MODE=VERBOSE REFERENCE_SEQUENCE=hla_2.fa    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 17:00:11 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [1]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [2]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [5]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [5]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [5]
[Fri Jun 09 17:00:11 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

[A] + [3]

WMCF9-CB5:20170609_myourshaw_hla_validatesamfile_error shlee$ java -jar $PICARD ValidateSamFile I=se_hla_test_hg38_fix.bam MODE=VERBOSE R=hla_3.fa
[Fri Jun 09 17:01:05 EDT 2017] picard.sam.ValidateSamFile INPUT=se_hla_test_hg38_fix.bam MODE=VERBOSE REFERENCE_SEQUENCE=hla_3.fa    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Jun 09 17:01:05 EDT 2017] Executing as shlee@WMCF9-CB5 on Mac OS X 10.11.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.9.2-SNAPSHOT
ERROR: Record 1, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [2]
ERROR: Record 2, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [3]
ERROR: Record 3, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [6]
ERROR: Record 4, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [6]
ERROR: Record 5, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [6]
[Fri Jun 09 17:01:05 EDT 2017] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=257425408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

User from forum thread is not likely using this type of artificial setup that I've concocted for testing purposes. So the error they see could well be from a full header situation.

It appears that for this HLA contig case, that when the SAM header contains a subset of the contigs in the reference, a situation that we allow for in our tools, that the validation logic is different. Because we allow this non-exact subset header contig list situation (I assume this because ValidateSamFile did not complain), either we should (i) not allow such differences, (ii) we should fix the validation bug for this type of scenario for ValidateSamFile, or (iii) we should carefully document this behavior in the tool doc for users to beware.

I assume we allow for non-exact subset header contig lists because in the past we have wanted to be able to drop EBV contigs and mito contigs.

If you want a tech writer's opinion, I think it prudent to allow for the subset-header-contig-list-situation because of just how long GRCh38's contig list is and how nice it would be to take around files indicating the full reference hash (e.g. as in CRAM validation) but only containing the @SQ lines pertinent for the alignments within the file.

sooheelee commented 7 years ago

Ok, I feel that I'm beating a dead horse here but in case you are not convinced. Here are two more results. These references aren't in the zipped bundle I uploaded but you can recreate them easily.

sooheelee commented 7 years ago

Finally, we can confirm these NM tag (nucleotide differences) in file [0] does not match reality [72] numbers correspond to the actual mismatches between the read alignment information and the mistaken identity contig using BLAST.

HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239  0   HLA-A*11:50Q    1091    30  101M    *   0   0   CGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGG   @?>?=>?@>@@@@==@@>>>>@>>@@@@@>@??@@>@@>@@>@@@@>@@>@@>@@@@@?@@@=@>=AA@>@@>@>@>>?>@?>>?@@@>>@>@@@>@?>??   ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:2  NM:i:0  SM:i:2  PQ:i:1  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756   0   HLA-A*11:50Q    1092    30  101M    *   0   0   GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAACGCAAGTGGGAGGCGGC   ?@@>=@@8@@@?==?@<==<@9<@5@@@>@>=@@9??>@@?@@@@>@>=@@<9?@<?><??<?<>>:>.??==:>>>@=9>=>.>>?==?=@??9>>?9=@   ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:83G17  PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:2  NM:i:1  SM:i:2  PQ:i:26 UQ:i:13 AS:i:18 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   
HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908   0   HLA-A*11:50Q    1101    30  101M    *   0   0   GGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCCCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGC   ?>@;<@@>===?==@@@@@=@=>@?=@@>@@>@>@@=@@=@A=@@@@@?@@@=9==@>@=@?+<=@<:?=@?/2<>9:/;;;@@@>=:?@@@==>;>?@@>   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:62T38  PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:1  SM:i:3  PQ:i:13 UQ:i:10 AS:i:

These correlations strongly suggest ValidateSamFile uses the first contig it sees in similarly named contigs to validate NM tags against.

yfarjoun commented 7 years ago

Without having run SetNmUqAndMd I'm not sure what this is proving...except that there are values in the bam that are wrong. sound like a user error to me, not a bug.

sooheelee commented 7 years ago

The values in the BAM are correct. The NM values match the alignments.

yfarjoun commented 7 years ago

The cases where the contigs in the dict do not match the contigs in the header are irrelevant. The program should blowup on them and I'm sorry it doesn't (it will soon). the references are looked up by their index and so if the index in the bam doesn't match the index in the reference you'll get garbage.

The only cases that this is honored are A + [1] and B + [2] and they indeed validate.

I'm not sure what else I can say. I opened a ticket in HtsJdk that will make ValidateSam be more strict (force dictionaries to match for example) https://github.com/samtools/htsjdk/issues/899

I still do not see a bug here. And most significantly, this has nothing to do with HLA. you can reproduce this by removing any contig from the header..the contigs that follow will be messed up. I suspect that your user removed one (or more) of the contigs between the autosomals and HLA....

sooheelee commented 7 years ago

Why can't the tool match on exact entire contig name? GATK had this bug where it interpreted colons as the end of a contig name and what followed the colons as genomic positions. The bug was unanticipated because using colons in contig names is new to GRCh38. I would like us to fix the source of this mismatch behavior.

this has nothing to do with HLA. you can reproduce this by removing any contig from the header..the contigs that follow will be messed up.

Ok. If you are certain of the above, then I will agree with you.

sooheelee commented 7 years ago

Come to think of it, for my test data, the indexes and dictionaries were generated exactly on each reference FASTA and are correct. My BAM doesn't have an index. So FASTA content should match exactly the corresponding index and dictionary content.

yfarjoun commented 7 years ago

Matching by index is faster than string comparison.

this has nothing to do with the colons despite the fact the gatk had a bug regarding colons.

I'm not taking about a .bai file. I mean the index number or position of the contig in the dictionary. Which is why the dictionaries must match.

On Jun 11, 2017 4:42 PM, "sooheelee" notifications@github.com wrote:

Come to think of it, for my test data, the indexes were generated exactly on each reference FASTA and are correct. My BAM doesn't have an index.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/833#issuecomment-307655488, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0utImOubOjlB63sKE2R6MrDpWkxUks5sDFFQgaJpZM4N1pPM .

sooheelee commented 7 years ago

There is no BAI. But YF is referring to the implicit index in the BAM header @SQ lines.