broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Odd assembly behavior observed in GATK4.beta.6+ HC/M2 but not GATK3.7 HC #4073

Closed sooheelee closed 4 years ago

sooheelee commented 6 years ago

For @droazen on his request. Thank you for looking into this.

shlee_ref200.zip

Observations and commands are in file shlee_ref200_commands.txt and should be copy-pastable if you run them in the shlee_ref200 folder. Folder contains everything you need (reference set, BAMs for normal, tumorA, tumorB, and ref (this last has no variants)) plus the results.

The reference is 200bp long and has a fair amount of low complexity regions. I interspersed some different nucleotides here and there such that pretty much 99% of simulated reads (2x50 to the reference) map unambiguously. If you are wondering of sequence context biases for site 121, then I point out that GATK3.7 HaplotypeCaller has no problems calling all of the designed variants in the three samples. It is GATK4.beta.6 and GATK4.beta.6+ (master branch of 1/6 Saturday morning) that seem to be blind to the G allele there (no problems with C allele in same location).

I think perhaps something is not at parity in terms of the parameters between the two sets and any insight would be appreciated.

Used gatk4.beta.6 master branch of Saturday 1/6/2017 (gatk-4.beta.6-144-g54af6b4-SNAPSHOT).

sooheelee commented 6 years ago

I've updated the commands that now use kebab:

[4] tumorA: GATK4 HC NOT seeing the G allele at 121 (but sees the C).

gatk HaplotypeCaller \
    -I tumorA.bam \
    -R ref200.fasta \
    -O gatk4_hc_tumorA.vcf.gz \
    --max-assembly-region-size 40 \
    --assembly-region-padding 10 \
    --min-assembly-region-size 10 \
    --debug

[5] normal: GATK4 HC completely blind to the GG at 121

gatk HaplotypeCaller \
    -I normal.bam \
    -R ref200.fasta \
    -O gatk4_hc_normal.vcf.gz \
    --max-assembly-region-size 40 \
    --assembly-region-padding 10 \
    --min-assembly-region-size 10 \
    --debug

[6] tumorB: GATK4 HC has no problem seeing the C allele at 121

gatk HaplotypeCaller \
    -I tumorB.bam \
    -R ref200.fasta \
    -O gatk4_hc_tumorB.vcf.gz \
    --max-assembly-region-size 40 \
    --assembly-region-padding 10 \
    --min-assembly-region-size 10 \
    --debug
sooheelee commented 6 years ago

This is the bug I told you about @davidbenjamin. Speak to @droazen as he's looked into it.

droazen commented 6 years ago

I am still in the process of looking at it, but basically what I've found so far is that with certain assembly region boundaries the HC ends up preferring an indel haplotype over the haplotype with the SNPs that GATK3 calls.

sooheelee commented 6 years ago

I've been thinking about this literal edge case. We now have metagenomic pipelines that are meant to align data to presumably extremely small references (bacteria, infectious agents, e.g. viri). These organisms have a different expectation for mutation/variant rates that my synthetic data could represent. I am unfamiliar with the details of the metagenomics pipelines except that it aligns reads to a giant conglomerate of different organisms. I forget whether the pipeline actually produces an alignment BAM or just a list of organisms--perhaps @mwalker174 could inform us. On the forum, we've had a few cases where we encourage folks to use our tools even when they work in other nonmammalian organisms such as bacteria. However, knowing how our assembler handles data at the edges of contigs, and how variants that are close together trigger alternate assumptions, e.g. the presence of an indel as I learned from @droazen, then I'd like to know how I should actually be informing our nonmammalian researchers. Whether they should or should not consider assembly-based calling, whether there are certain parameters they could employ to ensure calling some variant (even if wrong) rather than no variant within the confines of a small genome, or whether I should point them to a pileup caller, etc.

davidbenjamin commented 4 years ago

We're overhauling assembly.