BGI-shenzhen / VCF2Dis

VCF2Dis: A new simple and efficient software to calculate p-distance matrix and construct population phylogeny based Variant Call Format
MIT License
75 stars 20 forks source link

Error in pairwise distance value by a factor of 10 #1

Closed rachitasrivastava closed 1 month ago

rachitasrivastava commented 3 years ago

I ran the following command:

/VCF2Dis-master/bin/VCF2Dis -InPut input.vcf.gz -OutPut output.mat

Since I generated this matrix for which a dataset for which pairwise genetic distance values were already available, I compared the results of VCF2Dis to the available results. It seems like the results differ by a factor of 10. When I divide the result values of VCF2Dis by 10, the results match to the already available dataset. Can you please explain -

What happens to the sites with missing data in one sample of the two samples in a pair

What is L in your formula? Is it the complete genome or just the sites which are considered for calculating genetic distance

hewm2008 commented 3 years ago

1 VCF2Dis is the software to calculate p-distance matrix ,and p-distance is different with genetic distance. I thinks. 2 If one of the two samples genotype is missed, it will not participate in the calculation 3 the L is the Number of Pairwise comparison. for Example ,10 sites

sample1:    A  A  A  A   -    -  A  A  A  A
sample2:    A  A  A  M   A   -   A  C  A  A
DiffA:      0  0  0  0.5  -   -  0 1  0  0                   sum   1.5   Diff(1_2)    is   1.5 
VarL     :  1  1  1  1   0   0   1  1  1  1                 sum    8     L(1_2)       is    8

finally p_dis(1_2)= 1.5/8