LD Pruning of SNPs - Githubissues

kose-y commented 4 years ago

@Hua-Zhou at #58

We may open another issue regarding LD pruning of SNPs so that remaining SNPs are not very highly related. A literature review of how Plink does it is helpful.

kose-y commented 4 years ago

The source code is here. https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_ld.cc ... but can we find a text description?

kose-y commented 4 years ago

Looks relevant: https://gsejournal.biomedcentral.com/articles/10.1186/s12711-018-0404-z

kose-y commented 4 years ago

https://zzz.bwh.harvard.edu/plink/summary.shtml

Linkage disequilibrium based SNP pruning Sometimes it is useful to generate a pruned subset of SNPs that are in approximate linkage equilibrium with each other. This can be achieved via two commands: --indep which prunes based on the variance inflation factor (VIF), which recursively removes SNPs within a sliding window; second, --indep-pairwise which is similar, except it is based only on pairwise genotypic correlation. Hint The output of either of these commands is two lists of SNPs: those that are pruned out and those that are not. A separate command using the --extract or --exclude option is necessary to actually perform the pruning. The VIF pruning routine is performed: plink --file data --indep 50 5 2 will create files plink.prune.in plink.prune.out Each is a simlpe list of SNP IDs; both these files can subsequently be specified as the argument for a --extract or --exclude command. The parameters for --indep are: window size in SNPs (e.g. 50), the number of SNPs to shift the window at each step (e.g. 5), the VIF threshold. The VIF is 1/(1-R^2) where R^2 is the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously. That is, this considers the correlations between SNPs but also between linear combinations of SNPs. A VIF of 10 is often taken to represent near collinearity problems in standard multiple regression analyses (i.e. implies R^2 of 0.9). A VIF of 1 would imply that the SNP is completely independent of all other SNPs. Practically, values between 1.5 and 2 should probably be used; particularly in small samples, if this threshold is too low and/or the window size is too large, too many SNPs may be removed. The second procedure is performed: plink --file data --indep-pairwise 50 5 0.5 This generates the same output files as the first version; the only difference is that a simple pairwise threshold is used. The first two parameters (50 and 5) are the same as above (window size and step); the third parameter represents the r^2 threshold. Note: this represents the pairwise SNP-SNP metric now, not the multiple correlation coefficient; also note, this is based on the genotypic correlation, i.e. it does not involve phasing. To give a concrete example: the command above that specifies 50 5 0.5 would a) consider a window of 50 SNPs, b) calculate LD between each pair of SNPs in the window, b) remove one of a pair of SNPs if the LD is greater than 0.5, c) shift the window 5 SNPs forward and repeat the procedure. To make a new, pruned file, then use something like (in this example, we also convert the standard PED fileset to a binary one): plink --file data --extract plink.prune.in --make-bed --out pruneddata

kose-y commented 4 years ago

Source code for PLINK 1.07 is much more readable:

https://github.com/poulson/plink/blob/master/genome.cpp#L1172

OpenMendel / SnpArrays.jl

LD Pruning of SNPs #66