Illumina / PlatinumGenomes

The Platinum Genomes Truthset
https://illumina.github.io/PlatinumGenomes
84 stars 9 forks source link

Two indels or one SNV #13

Closed AuroUTU closed 3 years ago

AuroUTU commented 3 years ago

Hi, I am a PhD student who work on genomic variants. I found in Platinum genome 2017 files, NA12878.vcf.gz, hg19 (https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/hg19/small_variants/NA12878/NA12878.vcf.gz), there are two records: chr19 36397290 . CA C . PASS MTD=bwa_platypus;KM=8.96;KFP=0;KFF=0 GT 0|1 chr19 36397299 . A AT . PASS MTD=bwa_platypus;KM=9.30;KFP=0;KFF=0 GT 0|1

However, when I check the reference sequence of these two variants: CAAAAAAAAATTTTTTTTA Then I confused whether one deletion (A got delete) and one insertion (After A there is a T inserted) should be written as a SNV (A changes to T)

After I used GATK (v4.1.9.0) to call variants with the BAM file which download from ENA (PRJEB1813), the result is: 19 36397299 . A T 645.64 .AC=1;AF=0.500;AN=2;BaseQRankSum=-0.782;DP=46;ExcessHet=3.0103;FS=1.177;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.04;ReadPosRankSum=1.153;SOR=0.458 GT:AD:DP:GQ:PL 0/1:20,26:46:99:653,0,519

I wondered, why in the VCF file, there are two indels instead of one SNV. I think the genotypes of both indels are 0|1 means they happened on the same haplotype.

Waiting for your reply Thank you very much

eberle commented 3 years ago

Hello and thanks for using the Platinum Genomes. You are correct that this could be represented in multiple ways but we started with the calls made by the different software algorithms and in this example, the caller, Platypus, called this as two separate indels.

Cheers,

-Mike

AuroUTU commented 3 years ago

Thank you very much for this reply。

BEST