chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
415 stars 126 forks source link

Two issues with IBS matrix calculation #285

Closed alexviiia closed 1 week ago

alexviiia commented 2 weeks ago

Dear Christopher,

I was looking at the IBS calculations produced by plink 1.9, and observed some problems relevant to my work. I created a minimal BED file with just one SNP and with 4 individuals with all possible genotypes (including missing), which allowed me to see the values that the IBS function generates for every pair of genotypes, and I got this. (I tried attaching my BED file but github isn't letting me, can share by email if you'd like.)

Command:

plink1 --bfile data --distance square ibs --out ibs
Output matrix, reformatted for clarity: 0 1 2 NA
0 1 0.5 0 nan
1 0.5 1 0.5 nan
2 0 0.5 1 nan
NA nan nan nan 1

There are two diagonal values in this table that I consider bugs (underlined above), the rest are perfect.

  1. The most obviously concerning one is that when both genotypes are NA the result should be treated as missing, not as matching (in the context of this specific data, the result should have been nan since there were no observations). I think this is a clear bug, though fortunately low impact assuming low missingness.

  2. The other issues is more subtle but very consequential for my work. Following the IBS formula of Astle and Balding (2009), particularly Eq. 2.3 and Table 1 (screenshots below), the IBS value between heterozygous genotypes (1 and 1) should be 0.5, not 1, since when you consider them as unordered haploids (0,1) and (0,1), all of those pairwise comparisons only match each other half of the time (i.e. 0-0 matches, 1-1 matches, but 0-1 and 1-0 do not match). I know there may be some concern with breaking backward compatibility here, so I'm not opposed to giving this estimator a different name if you think that is more acceptable. I noticed currently these IBS calculations are such that the diagonal always equals 1 no matter what, but I disagree with that behavior since according to the interpretation I present here, self IBS should be 1 only if an individual is completely homozygous at every locus.

astle-balding-2009-ibs-eq

astle-balding-2009-ibs-table

Link to full Astle and Balding (2009) paper: https://projecteuclid.org/journals/statistical-science/volume-24/issue-4/Population-Structure-and-Cryptic-Relatedness-in-Genetic-Association-Studies/10.1214/09-STS307.full

Lastly, as a related request, if you agree with my proposed changes, could you also port the IBS calculations to plink2? I'm also interested in the 1-IBS version.

Thanks!!! -Alejandro Ochoa, Assistant Professor of Biostatistics and Bioinformatics, Duke University

chrchang commented 1 week ago
alexviiia commented 1 week ago

Thank you for the prompt reply. Your points make sense of course, I understand breaking backward compatibility is a huge deal and in light of your explanations I don't really wish the --distance ibs behavior to change. In particular, now that I see the diagonal is always 1, I confirmed in a bigger example that NA-NA pairs in different individuals result in nan estimates, as desired, so you can completely disregard my suggestion that those were being assigned to 1, since they are not, just the diagonal is special here. As for my second issue, I did confirm that heterozygote pairs in different individuals are assigned to 1 too, rather than 0.5, so that case is not explained by the diagonal always equaling 1, here the behavior I do not desire extends to off-diagonal calculations too.

Anyway, rephrasing, I still need the Astle and Balding calculations for my research, and in fact I already have implementations in R, but they scale poorly in large datasets (n=10,000 individuals), while in my experience plink is blazingly fast so I was hoping I could get this to work. I will look at your last suggestion tomorrow to see if I can get that to work for my needs.

Lastly, I do want to clarify that the Astle and Balding formula is much more meaningful for my applications because, after an additional transformation, these estimates reflect population kinship, and the diagonal reflects population inbreeding. It does not bother me that their formula doesn't assign self individuals to 1, or conversely that 1-ibs would no longer be a distance. Of course, you presumably want a distance if you're using --distance, but I actually desire to estimate covariance (kinship), and IBS estimates have desirable properties over the standard GRM and other estimators like KING-robust, all of which I'm aware of and have used and studied, but are not what I need moving forward. My previous work was on analyzing estimator bias, and I derived an unbiased estimator called "popkin", which happens to be a scaled version of the Astle and Balding formula, hence this request. I'll leave it here for now but I hope to eventually persuade you or somebody else to write a more efficient version of my estimator, I would be using it a lot in my work and I hope many others will too.

Thanks again!

alexviiia commented 1 week ago

I tried plink2 --make-king-table cols=+ibs1 and I agree it provides all of the pairwise proportions I need to perform my calculation. The diagonal values need a separate calculation but that's doable with a different plink2 command too. So overall, while not ideal, this solves my immediate problems. Thanks!