pcadapt returned a bit odd results, please help!

Mary-00 commented 5 years ago

Hi Michael and all,

I’m working with a list of SNP variants related to a complex disease. I would like to find which SNPs are outlier between my population and other populations, like 1000 genome project. To this end, I used vcftools to calculate Fst value between my population and each of 1000 genome populations (African, European, etc.). Then, I used “pcadapt” (version 4.0.3) for finding outlier SNPs, however, the results sound a bit odd. While pcadapt found 108 outlier SNPs between my population and African population, it founds 187 outlier SNPs between my population and European populations. We know that, the allele frequency of the SNP variants are much similar to European and differ from the African population, so I expected to have more outlier SNPs between my population and African than European. Also, I checked the Fst value obtained by vcftools for these outlier SNPs and found that just 17 SNPs from 187 outlier SNPs reported by pcadapt have the Fst value of equal to or greater than 0.1; so it’s a bit strange how many SNPs with Fst value of 0.00001 or 0.0019 were detected as outlier by pcadapt. Also, the score plot clearly show two distinct clusters of my population and African population, but no separated cluster for my population and European population. Could you please kindly share me your idea and any suggestions and comments?

Thanks in advance

privefl commented 5 years ago

Are those outlier SNPs in LD?

Mary-00 commented 5 years ago

I used LD.clumping within the pcadapt with thr=0.1. so, they shouldn't be in LD, yes?

privefl commented 5 years ago

It depends. Sometimes, clumping is not enough.

Could you link to your pcadapt object? Or even better, the data and code you used.

Mary-00 commented 5 years ago

You're right. clumping is not enough in my case. The print (outliers) showed many numbers that closely near each other, something like below: 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 .... so they are in LD, yes?

Could you please kindly give me your email address to send you the codes and related plots?

Thanks in advance

privefl commented 5 years ago

Just put them here. Or in some dropbox that I (and others) can access.

One possible solution would be to remove all the outliers from this run and rerun pcadapt using only non-outliers for the PCA step (but computing the p-values for everyone). Yet, this is not easy to do so with current code.

Mary-00 commented 5 years ago

I put the codes and the plots (loadings plot before and after LD.clumping as well as Manhattan plot) here:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno_pcadapt <- read.pcadapt ("file2.pcadapt", type="pcadapt")
x <- pcadapt (input = geno_pcadapt, K = 2)
plot (x, option = "screeplot")
par(mfrow = c(1, 2))
for (i in 1:2)plot(x$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
res <- pcadapt (geno_pcadapt, K = 2, LD.clumping = list(size=2887, thr = 0.1))
par(mfrow = c(1, 2))
for (i in 1:2)plot(res$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
p.adj <- p.adjust (res$pvalue, method=”BH”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

loadings_before_ld clumping loadings_after_ld clumping manhattan_plot

So, these outliers are still in LD. You mentioned that One possible solution would be to remove all the outliers from this run and rerun pcadapt using only non-outliers for the PCA step (but computing the p-values for everyone). Could you please kindly explain me more and tell me how it helps to solve the problem?

Thank you

privefl commented 5 years ago

Is it normal you have less than 3000 SNPs?

mblumuga commented 5 years ago

It is not a pbm to have outliers in LD. It is expected with dense data. pcadapt does similar thing that GWAS, it looks for SNPs related to a variable which is population structure for pcadapt. In GWAS, you have peaks, using pcadapt to detect outliers (i.e. SNPS excessively related to pop structure), you also expect peaks if you have dense data.

Did you look at the PC scores (PC2 vs PC1) by coloring points by population of origin to check that the score plot is relevant?

Mary-00 commented 5 years ago

Hi all, Regarding 3000 SNPs, these SNPs are related to a specific trait. Regarding the score plot, it clearly shows two distinct clusters of my population and African population, but no separated cluster for my population and European population as we expected (since we know that, the allele frequency of the SNP variants are much similar to European and differ from the African population), however, it is not matched with pcadapt results. mypop_afr mypop_eur

As I mentioned in the original post, pcadapt found 108 outlier SNPs between my population and African population, and 187 outlier SNPs between my population and European populations, which just 17 SNPs from 187 outlier SNPs have the Fst value (obtained by vcftools) of equal to or greater than 0.1 ; so it’s also a bit strange how several SNPs with Fst value of 0.00001 or 0.0019 were detected as outlier by pcadapt.

Could you please help me to correctly obtain the outlier SNPs?

Thanks

privefl commented 5 years ago

I think you should not used pcadapt on a subset of SNPs. It needs many SNPs to capture population structure and to identify outlier SNPs from the rest of "null" SNPs. See this other issue.

Mary-00 commented 5 years ago

Do you mean this subset of SNPs (about 2500 SNPs) is not enough dense to detect outlier? What about the Fst value? As I mentioned, I used vcftools to get the Fst value (on vcf file without LD clumping and MAF threshold) and got about 600 SNPs with Fst value of 0.1 or greater (that can be considered outlier, it’s also unusually large number of outlier) when comparing my population and African population, but it was just 17 SNPs with comparing my and European populations that is somewhat reasonable. However, the allele frequency of this subset of snps is significantly different among my population and population of 1000 genome (based on the Fisher’s test done on the count of these snps). So I would like to find the population structure relevant to these snps and the real outlier. Hope there is any way to solve this problem.

privefl commented 5 years ago

I fear that, when choosing only a subset of SNPs, you capture something else that population structure, and that you don't capture well the structure of the "null" SNPs (e.g. for estimating the covariance in the Mahalanobis distance but also to calibrate p-values). If you want to see what happens for those SNPs, just do pcadapt on all SNPs and see what happens for those ones.

Fst is a univariate measure, so it should not depend on other SNPs you test.

Mary-00 commented 5 years ago

Hi and thank you for your response; however, I’m a bit confused as I’m new and it’s my first experience in this field. You mentioned that the population structure doesn’t well capture using a subset of SNP. However, I think the PCA plot of this subset of snps sounds reasonable and consistent with the mean Fst value between my population and other populations. In the picture, blue color is my population and other is super populations of 1000 genome. Green and red colors are EAS and AFR populations, which the mean Fst value of them was 0.08 while it’s 0.02 for AMR population (Mustard color) and for other (EUR and SAS) was 0.01. Could you please kindly let me know your opinion about it, is this PCA plot correct?

pca

Also, the population structure for subset of variants related to CVD disease was shown in this paper (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0192446.

Regarding Fst value per each snp, I found that about 600 snps of 2500 snps have a large fst value (0.1 to even 0.6) between my pop and say AFR or EAS. Could you please let me know your idea about it, how it should be interpreted?

Actually, I wanted to use pcadapt to select outliers with more reliability than fst value. You suggested using pcadapt on all SNPs and see what happens for the subset of interest. My variants derived from whole-genome sequencing, so it’s a large number of snps. If I want to use all snps for pcadapt analysis, I have to merge the vcf files of each chromosome to have a single vcf file with all chromosomes for my population, and do the same task for 1000 genome vcf files, finally merged my vcf file with 1000 genome vcf file and feed to pcadapt as input and looking whether the snps of interest (subset of snps) detected as the outlier in the results, yes, is it your mean? If yes, if pcadapt can handle such a large vcf file as the input? I wish there is another solution. Please kindly correct me whenever I’m wrong.

Any comments and suggestion on this issue would be highly appreciated.

Many thanks in advance

privefl commented 5 years ago

I let @mblumuga answer for the interpretation.

For the feasibility, you could (should) use PLINK to do the merging/conversion/quality control.

Mary-00 commented 5 years ago

Hi @mblumuga, Could you please let me know your idea about my last comment? I'm waiting for an update from you.

Thanks

mblumuga commented 5 years ago

Africa Looking at the score plot, it seems to me that when using African pops and "your pop", PC1 discriminates your pop from African pops. In this case, you can use pcadapt whith K=1 and you should get correct outliers, which are the same as the SNPs with large Fst values. Europe If this the score plot, then again the rank of the chi-square statistic you get with pcadapt should be somewhat similar to the rank obtained when using Fst. Did you plot the chi-square stat as a function of Fst?

Mary-00 commented 5 years ago

Thanks for your feedback. I try pcadapt with considering your comments and get back to you.

Mary-00 commented 5 years ago

Hi Michael and all,

Regarding my original issue, I used pcadapt on my variants with K= 1 as you suggested in your last comment for detecting outlier SNPs between my population and another populations. Firstly, as privefl suggested in another issue (https://github.com/bcm-uga/pcadapt/issues/30), I used the below command for loading vcf file into pcadapt :

file <- read.pcadapt ("pop.vcf", type="vcf")
but got the following error:
Error in charSepXPtr (file, n = file.size(file), m = 1, r = 0):
Invalid argument

Finally, I load vcf file into pcadapt viavcf2pcadapt(data, output = "data.pcadapt", allele.sep = c("/", "|")) and performed the analysis with K=1, but the results didn’t improved. Actually, during the comparing my population with African and European populations, pcadapt returned 2 and 9 outlier SNPs, respectively. Again, the more outlier SNPs has been detected with European than African population, while we know that our allele frequency of variants is much similar to European population, so the fewer outlier expected when we compare our population with European. In the previous pcadapt analysis with K=2, pcadapt retuned 108 and 187 outlier SNPs during comparing my population with African and European populations, respectively. Now, with K=1, this result changed to 2 and 9 outliers. I really would like to use pcadapt for this kind of analysis, but sounds that something is still wrong. Here is the scores plot of my population with African:

afr-pop

And the scores plot of my population with European:

eur-pop

As you can see, these scores plots aren't matched with the reported outlier SNPs if I'm right. Could you please kindly tell me your suggestions and comments for doing the pcadapt analysis in the correct manner?

Thank you in advance

mblumuga commented 5 years ago

Dear @Mary-00,

it is not because differentiation between "your pop" and Europe is smaller than differentiation between "your pop" and Africa, that you expect less outliers. In a class with small kids (between 98cm and 102 cm), 3 kids measuring 105 cm are outliers. In a class with tall kids (between 150 and 170 cm), there might be not outliers at all!

Again, I suggest you plot -log10(pvalue) as a function of Fst to check that outliers are correct. To compute Fst, here is a suggestion. Hope this helps.

Mary-00 commented 5 years ago

Hi Michael,

I back to you with a bit delay as our server was under reconstruction. Please kindly share your opinion about the issue with me. I followed your advice in the previous posts, so I run pcadapt with K=1 for finding outlier SNPs between two populations and draw the plot of log10 (pvalue) as a function of Fst to check that outliers are correct. Here is the plot for my population vs African population (1000 genome); here, pcadapt reported just two outliers SNPs at FDR < 0.05 that is correct as they have large Fst value (about 0.6). however, my question is, why other SNPs that have also large Fst value has not been detected as the outlier by pcadapt?

fst_afr

The similar issue was also happened during comparing my population with East Asian population; the plot of log10 (pvalue) as a function of Fst for this comparison is below:

fst_eas

Here, pcadapt didn't return any outlier, although there are some SNPs with high Fst value.

Thanks in advance

mblumuga commented 5 years ago

To detect outliers, you have 3 techniques that I list below from the more to the less conservative one (see E. Choosing a cutoff for outlier detection in the pcadapt tutorial) Bonferroni Benjamini Hochberg qvalue

Depending on the approach you choose, you might end up with a different number of outliers.

You might also consider an other outlier approach to compare with pcadapt. I suggest the Fst approach called Outflank.

Mary-00 commented 5 years ago

Hi Michael, Thanks a lot for your response. Yes, I'm aware of various pvalue correction methods have different power. However, the p-value of my results (outlier SNPs) that I talked about them was adjusted by FDR approach that is less conservative as far as I know; I may be wrong.

As you showed that the performance of pcadapt is more than Outflank in your paper and also considering this issue with Outflank https://github.com/whitlock/OutFLANK/issues/12, I don’t think to get better results with Outflank. Considering these issues, could you please let me know if you still suggest using Ouflank or any alternative tool?

Thanks for all your help

mblumuga commented 5 years ago

Sure, If you want to compute fst between 2 pops (your pop vs Africa or your pops vs East Asia), I would recommend to compare results of OutFLANK and of pcadapt.

bcm-uga / pcadapt

pcadapt returned a bit odd results, please help! #31