broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 588 forks source link

GATK User Unable to replicate the spanning deletions documentation #6588

Open GATKSupportTeam opened 4 years ago

GATKSupportTeam commented 4 years ago

This request was created from a contribution made by Saurabh Parikh on April 27, 2020 01:53 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/360061866552-Unable-to-replicate-the-spanning-deletions-documentation

--

Hello,

I am trying to understand GATKs behavior regarding spanning deletions. I went over the documentation page,  spanning or overlapping deletions (* allele) and wanted to replicate the behavior. I was unable to do so. Following is what I did.

GATK version used - 4.1.7.0

simulated reads to observe the spanning deletions. The reads were split into four samples, in the same manner as the documentation. Each sample had 50 identical read pairs with the following alleles at a certain position

The deletion started 2bp upstream of the concerned position. Here is the image from the article modified to explain this case.

  1. These reads were mapped using BWA using: bwa mem -C -M -t 10 /path/to/ref/hg19 R1.fastq R2.fastq > align.sam
  2. The sam file was split into multiple files, one per sample. For each sample sam file GATK HaplotypeCaller was run. The bamout option was used to analyze the realigned bam files. They look as expected except that instead of 25/25 reads for each allele for each sample there were 23/27 reads in all the samples. I am not sure why. gatk --java-options "-Xmx2g" HaplotypeCaller -R /path/to/ref -I /path/to/sam --emit-ref-confidence BP_RESOLUTION -L /path/to/bed_file -O /path/to/gvcf_file -bamout /path/to/sample.bam
  3. The four gvcf files were then merged to generate the vcf file using:

gatk GenomicsDBImport --arguments_file /file/containing/variants/arguments --genomicsdb-workspace-path /path/to/gendb/ --intervals /path/to/bed_file gatk GenotypeGVCFs -R /path/to/ref -V gendb://gendb/ -L /path/to/bed_file --all-sites --include-non-variant-sites -O /path/to/final_vcf_file

For the position of interest, the expected outcome according to the article linked above was:

However the observed calls were:

The GT for samples 2 and 4 was not as expected. Following are the other fields for all the samples (NOTE: The REF and ALT were changed to match my actual case):

Could someone please answer the following questions that come to my mind:

  1. Why did the DPs changed from 25/25 to 23/27 when the HaploytpeCaller realigned the reads
  2. More importantly, for sample 2 and sample 4, why did the GT not match the GT as shown in the article.

Please do let me know if any file need to be shared, and where I should share them.

 

Thank you,

Saurabh Parikh

(created from Zendesk ticket #5402)
gz#5402

bhanugandham commented 4 years ago

As discussed, creating a issue ticket to investigate why the user is seeing inaccurate genotype calls for spanning deletion. Synthetic bam files have been uploaded to the server with the name parikh_spanning_deletions_data.zip. Included in the folder:
./bam_per_RG/ - Contains the synthetic bam and their index files for each sample ./commands.txt - The commands I used to generate the final files ./other_files/gatk.bed - Used to generate output for the interval of interest ./other_files/gvcf_per_RG/ - The gvcf files and the bamout files generated per sample ./other_files/logs - Logs of HC, GenomicsDBImport, and GenotypeGVCFs ./other_files/samples.vcf - The final vcf file generated

User was able to replicate the issue using the standard fasta file found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/