Data filtering before analysis

aydaraliev commented 3 years ago

Hi everyone! To perform traditional PCA the populations should be filtered for MAF and HWE independently from each other. At the same time, LD pruning is commonly applied to the full set of samples (populations) analyzed. To have no overestimated eigenvectors the PCA analysis is also required to move outlier samples (e.g., through applying the 6 sigma rule). How about MAF, HWE, LD pruning, and 6 sigma rule in PCAdapt? Does it utilize the same approach or not?

privefl commented 3 years ago

There is a parameter for LD pruning and MAF threshold in pcadapt. For any other QC, you need to do it yourself using e.g. PLINK. Note that the 6 sigma rule is not really great for detecting outliers in PCA, we discussed it here: https://doi.org/10.1093/bioinformatics/btaa520.

akhrunin commented 3 years ago

Dear Florian, thanks for your answer and useful article as well. Nevertheless, I would like to ask you to provide some details about the way of LD pruning (clumping?) and MAF threshold in PCAdapt. In particular, how tested populations should be pruned: separately or put together (e.g., we would like to run PCAdapt with 9 different populations at once)? Being pruned separately, the populations will have little shared SNPs. Thus logically, it would be optimal to prune them together put as a whole. In the case of MAF filtering, the situation does not look the same as here we will not expect loosing too high number of SNPs compared to population specified LD pruning. How do you filter the data sets for MAF? For example, in hapFLK, the SNPs are filtered by MAF only after merging the genotypes of all population samples tested. And one more thing I would ask – how many populations can be analyzed using PCAdapt in one run (are there any limitations)? I know articles where the populations were analyzed in pair-wise manner when a tested population was compared with one reference population at each run (totally with several reference populations). How such approaches are relevant?

privefl commented 3 years ago

Everything is performed on the whole data, including pruning/clumping and MAF threshold.

I think you should run pcadapt on all populations at once to maximize power. But I guess if you are interested in variants separating only two populations, then you can run it pairwise (by generating the subset datasets yourself with e.g. PLINK).

akhrunin commented 3 years ago

Dear Florian, I appreciate your answer! Could we also discuss shortly HWE testing? I understand that HWE test should not be applied to a whole set of data when they came from different populations. But what do you think about its application to a single population? One widely accepted reason to use HWE test is excluding of SNPs with genotyping or genotype-calling errors. On other hand, deviations from HWE may be also due to natural selection. So, to maximize power (in terms of detection of all potential signals of selection) HWE test should not be used at all. Are you agreeing?

privefl commented 3 years ago

I'm no expert in how best use HWE testing. I thought I remembered reading some paper that showed how to perform HWE while accounting for structure of multiple populations, but I can't find it back (if it is not just my imagination). Sorry I can't be of more help.

akhrunin commented 3 years ago

Dear Florian, Accordingly to your suggestion we start of PCAdapt on all our populations. At first, we obtained both the scree plot and the figures of Ioadings for first four PC’s on the initial full set of SNPs (512K SNPs with MAF > 0.05). The loadings did not clearly demonstrate (no huge picks) that any of the tested PC’s captured LD structure. Just in case, we also performed the same tests for two subsets of thinned SNPs (LD.clumping = list(size = 200, thr = 0.1) and list(size = 500, thr = 0.2). The obtained scree plots and loadings looked similar to the run used the full list of SNPs. So, I would like to ask if the loadings were OK could we use the whole genotype matrix (the initial list of SNPs) for searching signals of natural selection to maximize the power?

privefl commented 3 years ago

At some point, I think adding more SNPs won't add power anymore. You're probably fine with what you have right now.

NB: You should still get the stats for the ones removed by clumping.

privefl commented 3 years ago

clumping200-0 1_scree

Given the scree plots that you sent, I would use K=9.

The first 4 loadings looks fine, you can look at the next 5 ones as well if you use K=9.

akhrunin commented 3 years ago

Dear Florian, Thanks for the comments! I would like to ask: 1) How can we get the stats for the results of clumping (the list of SNPs left after clumping)? Unfortunately, we could not find how to do that in the tutorial and manual. 2) Looking at the scree plot we considered K=4 as the optimal K for collecting the outlier SNPs (accordingly to Cattell’s rule). That was why we checked the loadings only for PC1 to PC4 (one our colleague even suggests to consider K=2). Why do you consider the higher K-value as more reliable? Even the results of ADMIXTURE analysis gave K=7 (the set of pops included 9 pops from Russia and 4 pops from the 1000 Genomes).

privefl commented 3 years ago

These would be the ones for which the loadings are not missing I guess.
I would have justified K=9 based on Cattell’s rule. You could probably choose more clearly using a log scale (e.g. adding + ggplot2::scale_y_log10()). You can also check the PC scores to see if they do separate your individuals into several groups.

akhrunin commented 3 years ago

Thanks for the clue!

privefl commented 3 years ago

For the HWE in structured populations, there is this reference you can look at: https://doi.org/10.1111/1755-0998.13019

bcm-uga / pcadapt

Data filtering before analysis #57