chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
409 stars 125 forks source link

Distance matrices meets multiallelic sites #214

Closed esctrionsit closed 2 years ago

esctrionsit commented 2 years ago

Hi,

When I'm calculating genetic distance between samples using plink, I found the result is weird to me.

The VCF file I used is :

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003 NA0004
20  1   .   A   G   50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    0/0:35:4    0/0:17:2    0/0:40:3    0/0:40:3
20  2   .   A   G   50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    0/0:35:4    0/0:17:2    0/0:40:3    0/0:40:3
20  3   .   A   G,C 50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    0/0:35:4    0/0:17:2    1/1:40:3    2/2:40:3
20  4   .   A   G   50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    0/0:35:4    0/0:17:2    0/0:40:3    0/0:40:3
20  5   .   A   G   50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    0/0:35:4    0/0:17:2    0/0:40:3    0/0:40:3
20  6   .   A   G   50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    0/0:35:4    0/0:17:2    0/0:40:3    0/0:40:3
20  7   .   A   G   50  PASS    NS=3;DP=9;AA=A  GT:GQ:DP    1/1:35:4    0/0:17:2    0/0:40:3    0/0:40:3

The result was:

0   2   4   2.33333
2   0   2   0
4   2   0   0
2.33333 0   0   0

What makes me confused is that why the distance between NA0004 and NA00001 is 2.3333 rather than 4, and the distance between NA0004 and NA00003 is 0 instead of 2?

The command I used is:

plink --vcf test.vcf \
      --allow-extra-chr \
      --distance square flat-missing \
      --out plinkdist

And my plink version is v1.90b6.10 64-bit (17 Jun 2019)

Could you please explain more details about the algorithm in calculating distance to me? Thanks so much!

chrchang commented 2 years ago

PLINK 1.9 does not support multiallelic variants. By default, --vcf imports the two most common alleles, and converts the rest to missing.

esctrionsit commented 2 years ago

Got it. Thanks for your reply!