P-values equal to 0 from pcadapt

Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.

GNU General Public License v3.0

44 stars 11 forks source link

P-values equal to 0 from pcadapt #77

Open akimmitt opened 10 months ago

akimmitt commented 10 months ago

I am using --pcadapt to create selection scans for a comparative project. From my understanding, PCAdapt's algorithm is more robust than FastPCA's algorithm and preferred in the literature.

However, after using the pcadapt.R script to get p-values, some SNPs for many of my species have a p-value =0. These results can't be plotted as -log10(pvalue) because that produces infinite values.

I tried using a median sliding window approach, but am still getting a fair number of zero p-values.

I'm wondering why this might be occurring? Also, is there's an accepted conversion/transformation for this data rather than removing all values that are infinite?

Thanks! Abby

Rosemeis commented 10 months ago

Hi Abby,

Can you share how you generated the p-values as well as a bit of info on the data? :-)

Best, Jonas

akimmitt commented 10 months ago

Hi Jonas,

Sure! My data are lcWGS (3-5x coverage) from boreal songbirds with Appalachian ranges. We are observing both continuous and discrete population structure in our PCAs and admixture plots, in which the Appalachian populations are genetically distinct from the boreal population.

I first ran PCANGSD with the following options: pcangsd --selection --sites_save --snp_weights --pcadapt --minMaf 0.05 -b $BIN -o $OUTDIR/filename, where the input was a beagle file generated from ANGSD.

I then used the PCadapt zscores.npy to extract the p-values using the pcadapt.R script provided:

if (K == 1) { d2 <- (zscores - median(zscores))^2 } else { d2 <- dist_ogk(zscores) }

p.values <- pchisq(d2, df=K, lower.tail=F)

Thanks! Abby

Rosemeis commented 10 months ago

Thanks for the info. Can you also share the output from PCAngsd, if you saved it. :-) How many eigenvectors did PCAngsd use in the estimations?

Best, Jonas

akimmitt commented 10 months ago

Hi Jonas,

I assume you mean the npy file? I'm unable to attach that here because it is too large. If this is what you want to see, can you let me know how to share it with you?

The default was one eigenvector, however, I also forced it to produce two using --selection_e 2 In this case, I also saw that the first vectors didn't match between the default run and the --selection_e 2 run.

Thanks, Abby

Rosemeis commented 10 months ago

Hi Abby,

Sorry for the vague question. I meant more info regarding your run (potential log-file). As in info of number of samples and sites etc.. If you know to some extent what the number of eigenvectors should be to capture the population structure, then I would rather use "--n_eig / -e" instead of "--selection_e". See if that changes your output. And I would prefer "--selection" rather than "--pcadapt" myself. :-)

I have just released a major (v1.2) that shouldn't affect your results at all, but the flag "--minMaf" has been changed to "--maf", if you want to try the new update.

Best, Jonas

akimmitt commented 10 months ago

Hi Jonas,

Sorry for my misunderstanding– I've attached two example log files if it's still useful (I'm working on a comparative project so # inds and # sites vary). In this case, it is using only 1 PC (by default) but I am interested in what is happening in the second PC as well, so I will use the "--n_eig / -e"

I have switched over to using --selection rather than --pcadapt, because the results were more as expected, but I am curious why you prefer --selection over --pcadapt? Smagn_PCA_selection.59261490.out.txt Jhyem_PCA_selection.59261431.out.txt

Rosemeis commented 10 months ago

Thank you! I would definitely re-run PCAngsd with "--n_eig 2" in that case. :-) With only 1 PC in the modelling, that may cause the discrepancies in the p-values when trying to include 2. I like "--selection" a bit more since it computes test statistics per PC, however, they should produce similar results. You can definitely also use both and compare.

Best, Jonas