Open gbrandt6 opened 4 years ago
Hi, thanks for pointing this out. I actually looked at the SAM file:
samtools view test.bam| cut -f 3 | uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr20
chr19
chr22
chr21
and my reference file:
zcat hg19.fa.gz|grep '>'
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chrX
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr20
>chrY
>chr19
>chr22
>chr21
>chr6_ssto_hap7
>chr6_mcf_hap5
>chr6_cox_hap2
>chr6_mann_hap4
>chr6_apd_hap1
>chr6_qbl_hap6
>chr6_dbb_hap3
>chr17_ctg5_hap1
>chr4_ctg9_hap1
>chr1_gl000192_random
>chrUn_gl000225
>chr4_gl000194_random
>chr4_gl000193_random
>chr9_gl000200_random
>chrUn_gl000222
>chrUn_gl000212
>chr7_gl000195_random
>chrUn_gl000223
>chrUn_gl000224
>chrUn_gl000219
>chr17_gl000205_random
>chrUn_gl000215
>chrUn_gl000216
>chrUn_gl000217
>chr9_gl000199_random
>chrUn_gl000211
>chrUn_gl000213
>chrUn_gl000220
>chrUn_gl000218
>chr19_gl000209_random
>chrUn_gl000221
>chrUn_gl000214
>chrUn_gl000228
>chrUn_gl000227
>chr1_gl000191_random
>chr19_gl000208_random
>chr9_gl000198_random
>chr17_gl000204_random
>chrUn_gl000233
>chrUn_gl000237
>chrUn_gl000230
>chrUn_gl000242
>chrUn_gl000243
>chrUn_gl000241
>chrUn_gl000236
>chrUn_gl000240
>chr17_gl000206_random
>chrUn_gl000232
>chrUn_gl000234
>chr11_gl000202_random
>chrUn_gl000238
>chrUn_gl000244
>chrUn_gl000248
>chr8_gl000196_random
>chrUn_gl000249
>chrUn_gl000246
>chr17_gl000203_random
>chr8_gl000197_random
>chrUn_gl000245
>chrUn_gl000247
>chr9_gl000201_random
>chrUn_gl000235
>chrUn_gl000239
>chr21_gl000210_random
>chrUn_gl000231
>chrUn_gl000229
>chrM
>chrUn_gl000226
>chr18_gl000207_random
So there is no contig in the SAM file itself that is not included in the reference. But I pre-filtered my SAM file using samtools, but there are some contigs in the header of SAM that does not exist in the reference. So it seems that this message comes from the header? I can also use the original reference used for mapping, which also contains e. coli genome.
@Zepeng-Mu You say the original mapping contained contigs not in your reference but you filtered it with samtools to remove those contigs? I wonder if somehow some mappings to other contigs remain. I opened a PR ( #6781 ) to improve the error message. If you wanted to debug it further and ( have the time and inclination) you could build that commit and rerun with it to get the new error message.
Alternatives to proceed would be to use the original reference you mapped it with, or run HaplotypeCaller with an intervals file that only contains the contigs that match the hg19 reference.
Hi, it is not likely that some mappings to other contigs still remain in the BAM file, as the command I used samtools view test.bam| cut -f 3 | uniq
check the chromosome entry of all the mapped reads.
I actually tried removing the contigs in the header section, and now it works fine. The end of the run looks like:
22:45:30.861 INFO ProgressMeter - chr21:48065662 88.1 10112630 114731.5 22:45:30.976 INFO HaplotypeCaller - 0 read(s) filtered by: MappingQualityReadFilter 0 read(s) filtered by: MappingQualityAvailableReadFilter 0 read(s) filtered by: MappedReadFilter 0 read(s) filtered by: NotSecondaryAlignmentReadFilter 0 read(s) filtered by: NotDuplicateReadFilter 0 read(s) filtered by: PassesVendorQualityCheckReadFilter 0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter 0 read(s) filtered by: GoodCigarReadFilter 0 read(s) filtered by: WellformedReadFilter 0 total reads filtered 22:45:30.976 INFO ProgressMeter - chr21:48129366 88.1 10112861 114731.7 22:45:30.976 INFO ProgressMeter - Traversal complete. Processed 10112861 total regions in 88.1 minutes. 22:45:31.288 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.864119336 22:45:31.288 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 115.66789462000001 22:45:31.288 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 90.73 sec 22:45:31.289 INFO HaplotypeCaller - Shutting down engine [August 31, 2020 10:45:31 PM CDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 88.19 minutes. Runtime.totalMemory()=2630352896
And now the header looks like:
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr20 LN:63025520
@SQ SN:chr19 LN:59128983
@SQ SN:chr22 LN:51304566
@SQ SN:chr21 LN:48129895
So I still think it is the header in BAM that is causing the error message. Thanks
(related to Zendesk ticket #14162)
@Zepeng-Mu Interesting. Thank you for the follow up! That helps clarify what's happening.
Bug Report
Affected tool(s) or class(es)
HaplotypeCaller GVCF mode
Affected version(s)
GATK 4.1.8.0
Description
Discussed on the GATK forum: https://gatk.broadinstitute.org/hc/en-us/community/posts/360072760032-HaplotypeCaller-NullPointerException-Error
Command:
gatk --java-options "-Xmx4g" HaplotypeCaller -R hg19.fa.gz -I test.bam -O test.g.vcf.gz -ERC GVCF
Stack Trace