possible error in tajima.test that causes miscalculation of the statistic

omys-omics commented 1 year ago

Hi, I've been trying to use the tajima.test function on a dataset that includes missing data, and I noticed that there is a huge difference in the value of Tajima's D that is returned depending on whether I include sites with missing data or get rid of them before calculating the statistic. Specifically, the issue is when there is missing data in at least one sequence at a segregating site. I looked at the calls within tajima.test to the functions dist.dna and seg.sites, and I think that's where the issue is. By default, dist.dna drops sites with missing data from its calculation of pi, while seg.sites doesn't, resulting in different sites being used for calculating of pi and S if there is missing data at a segregating site.

For example, with this toy dataset: A TAA B TAA C ATT D AAT The returned value of Tajima's D is 1.089763 (p=0.2758177)

But if there is missing data in one sequence at the first site: A TAA B TAA C ATT D NAT Then the returned value of Tajima's D is -2.598665 (p=0.00935871)

However, I think this problem can be solved by simply setting pairwise.deletion = T in the dist.dna function. This (counterintuitively) keeps sites with missing data in its calculation of pi, meaning the same sites are considered for both pi and S. For the toy dataset with missing data at one site, the returned value of Tajima's D is then -0.7544511 (p=0.4505784)

I wanted to bring this to your attention, because it seems that this issue can result in extremely significant values to be returned, but when the missing data is accounted for, the values may not be significant.

emmanuelparadis commented 1 year ago

Hi, Thanks a lot for pointing this out. I fixed it and pushed here. Best,

omys-omics commented 1 year ago

Great, thanks!

emmanuelparadis / pegas

possible error in tajima.test that causes miscalculation of the statistic #77