Closed BelenJM closed 4 years ago
My popgen knowledge is very limited, but I can surely help with the software and the statistical methods.
Why more outliers using more PCs?
The advantage of pcadapt over methods such as Fst is that pcadapt should have more power by using more PCs. So, if you have more than 1 PC associated with pop structure, then it should always be benificial to use more than 1 PC. But you should verify that you're not using PCs capturing some other structure, such as LD.
I don't know if it is common to have so few SNPs (I'm more used to human data), but 1800 is probably too small. In pcadapt, you need to have "null" variants to be able to capture outliers correctly. You can always add some fake SNPs for this purpose (cf. #56).
Anyway, you should check with the PC scores the structure you're capturing. And probably check that the p-values are okay.
Thanks for the detailed answer and for the suggestion on the fake SNPs. The reasoning for adding "null" SNPs -- is it related to the computation of the Mahalanobis distance, as the algorithm uses the mean overall SNPs in the dataset to compute the distance to determine if an outlier is such or not? From a stat point of view, would there be any potential bias by incorporating this "noise" into the dataset? Just as a follow up from your suggestion: after adding 200e3 null SNPs generated using the distribution of the heterozygosity in these two populations I am comparing, the Manhattan plot looks like this: , where the first 2000 correspond to my dataset. When exploring the outlier SNPs, most of them are among the "real" SNPs and only a few among the "fake" SNPs.
What is your threshold of significance that you're using? You can probably use 0.05 / 200e3, or trying some FDR tool.
I actually used 0.01 using a BH approach. I get around 91 outliers from the "real" SNPs, and 5 from the fake ones.
Is there still something bothering you?
I wonder if you could explain the reasoning for adding "null" SNPs -- is it related to the computation of the Mahalanobis distance, as the algorithm uses the mean overall SNPs in the dataset to compute the distance to determine if an outlier is such or not? From a stat point of view, would there be any potential bias by incorporating this "noise" into the dataset? Thanks a lot!
To detect outliers, you need a lot of inliers. I don't think it introduces any bias, and this is a good check that everything works fine. You can check the distribution of pvalues for the fake variants, and all variants as well. You should look at the PC scores and the scree plot to choose K.
Thanks a lot for the insights - really appreciated.
I am using PCAdapt to study some salmon populations, where we have temporal data from a capture seq experiment. We are investigating the change in local adaptation in these populations. And for this we are looking into seeing if we can see some outliers associated to PC1, where we have most of the variation observed (in general) between the 2 populations we are comparing in each case. I noticed that when I select K=1 in pcadapt main function, I get some specific SNPs – however, when I select more K’s, I get more SNPs associated to K=1 that were not associated when I used only K=1. And I am not sure I get the reason why.
Also, I am using a dataset for this analysis that I have filtered for a bunch of parameters. Basically we filtered out those SNPs that were non-biallelic, indels, had more than 80% of missing data, a minor allele frequency (MAF) < 0.05, a minor allele count (MAC) < 2, and a mean depth < 3. We filtered out loci for which the observed Heterozygosity was larger than 0.5. We kept this panel of SNPs as “dataset 1”, with around 2200 SNPs, spread over 29 chromosomes. We also performed a pruning step, where we selected 1 SNP every ~1000bp and took out SNPs with r2 < 0.25. This pruned dataset consisted of ~1800 SNPs. I am using for the analysis dataset1 but I am wondering if it is the best dataset to use in pcadapt.
Any insights are really welcome! Thanks a lot in advance for the help.