Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

Inflated p-values from selection scan? #49

Open kellybarr opened 3 years ago

kellybarr commented 3 years ago

I'm running the selection scan in PCANGSD and getting really high pvalues. I've tried numerous options, but always end up with results like what I've attached here. Might anyone have any thoughts on why we're getting these results? I'll also attach the PCA. With significance at 10^-8, like the whole doggone genome is under selection.

man_plot_2K

image

Rosemeis commented 3 years ago

Hi, This definitely looks like something is wrong. Can you tell which selection statistic this is from, and how you have plotted it?

Best, Jonas

kellybarr commented 3 years ago

Yes, this one is with -pcadapt and with an -e 2.

I plotted in r using some custom script I found online (https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_Manhattan_Plots_in_R).

kellybarr commented 3 years ago

But I've also run -selection and get the same outcome.

Rosemeis commented 3 years ago

Okay, I think the problem is that you are plotting the selection statistics instead of the p-values.

For "-pcadapt" option and "-e 2", you would use 2 d.o.f. after running the R script for generating the selection statistics from z-scores (pcangsd/scripts/pcadapt.R): pvals <- pchisq(selection, 2, lower.tail=F)

And for the "-selection" option, you would use 1 d.o.f. for each PC independently directly from the PCAngsd output: pvals.pc1 <- pchisq(selection[,1], 1, lower.tail=F) pvals.pc2 <- pchisq(selection[,2], 1, lower.tail=F)

Hope this helps, Jonas

kellybarr commented 3 years ago

I wish that was the case! These are the p-values calculated from the chisq distribution as suggested in the r script provided.

head(sort(data$P)) [1] 3.988167e-306 9.431123e-296 1.175848e-278 2.054144e-272 [5] 2.399460e-264 4.171052e-251

kellybarr commented 3 years ago

Oh and here's the selection stat sorted:

head(sort(d2)) [1] 2.655629e-06 9.785684e-06 1.649725e-05 1.996042e-05 [5] 2.384299e-05 3.747775e-05

KHanghoj commented 3 years ago

Sorry for the late reply.

the pcadapt selection statistic returns very inflated p-values when the population structure is not continuous. Based on PC1-2, the samples do not show a continuous separation. see fig s5 in https://www.biorxiv.org/content/10.1101/2021.03.01.432540v1 where we test CEU,YRI, CHB. PCadapt returns very inflated p-values.

best, Kristian

akimmitt commented 1 year ago

Hello, I'm having a similar issue to @kellybarr, except the FastPCA and PcAdapt results are very different. My PcAdapt p-values are extremely small/inflated in some cases (e.g., 9.03173459559243e-304, sometimes it's even zero); this means that I have snps that are close to 200-300 when I use the -log10 conversion (whereas I think it's very typical of the y-value on Manhattan plots to be around 30-40 max); since some p-values are zero, they can't even be plotted. On the other hand, SNPs have p-values no smaller than 1e-7 when I use the FastPCA method.

I got these results using the script: pcangsd --selection --sites_save --snp_weights --pcadapt --minMaf 0.05 -b {beagle file from ANGSD} -o {header for all cov and npy files}

I then used the pcadapt.R script from this site to find my test statistics and p-values from the zscores produced by PCANGSD.

Do you know any reasons why these p-values would be so small for the PCAdapt method?