Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

selection and pcadapt output questions #90

Open lkrichardson opened 2 months ago

lkrichardson commented 2 months ago

Hi I have two questions one about --pcadapt and one about --selection.

  1. pcadapt- In the paper you describe adjusting the output by the genome inflation factor, how can I do that with my output? I was able to calculate the p-values based on the output with the associated script included in pcangsd.

  2. I also ran --selection without specifying the number of pcs, and got output for 3 pc axes which was what I expected given my admix results. I calculated p-values from the output for each pc axis using your recommendations in R: pvals.pc1 <- pchisq(selection[,1], 1, lower.tail=F) after applying Benjamini–Hochberg, the interesting thing is that none of the pvalues are significant along pc1 or pc3, and only ~22K are significant along pc2. I also looked at the distribution of p-values and they look kind of weird- along pc1 the histogram is hump shaped, and for the other two there is more of a U shape. This is in contrast to the pcadapt output which after Benjamini–Hochberg yielded ~360K significant SNPs from nearly 3 million that were input, and the histogram of pvalues is more uniform especially at the right side of the plot.

I'm wondering why the results between the two methods might be so different and if something strange is going on with one or the other method? Any insights would be helpful, hopefully these questions are appropriate to post here and if not I'm sorry and thanks!

Rosemeis commented 2 months ago

Hi,

Thank you for reaching out. The genomic inflation factor is calculated as follows in R: GIF=median(obs, dof)/qchisq(0.5, dof), where dof is the degrees of freedom.

Could you provide with some more information (sample size, number of SNPs) and maybe output log from pcangsd? :-) You can also attach the histograms if possible.

Best, Jonas

lkrichardson commented 2 months ago

Hi Jonas, Thanks so much for your response.

Just to follow up on the GIF and make sure I get the correct order of operations- I do the GIF calculation to the output of pcangsd pcadapt before calculating pvalues, and then I can still use the same script for pcadapt to calculate pvalues, and then I can control for false discovery rate, is that correct?

I guess I didn't save the output log unfortunately. Here was my logfile for pcangsd: PCAngsd v1.11 Time: 19/06/2024 09:26:57 Directory: /home/lkr Options: -beagle /home/lkr/jtgi/angsd/unlinkedSNPs/unlinkedSNPs.beagle.gz -out /home/lkr/jtgi/angsd/unlinkedSNPs/pcangsdSelection/pcSelmaf05 -selection -pcadapt -sites_save

The input file was beagle output from angsd that was LD-pruned to 6691644 SNPs (at minMaf 0.01). I didn't change the default for pcangsd, so it filtered down to 2942397 with minMaf 0.05, and this is for a sample size of 298 trees sampled across the range. Here are screenshots of the histograms of pvalues, the pvals.sel.pcX are the histograms of p-values associated with the --selection method, and in contrast you can see the histogram output from pcadapt. It also occurred to me to visualize the snp weights but I realized I didn't output them in a file.

For more general background info- this is low coverage whole genome data from a long-lived desert tree species. From visualizing the pca and admix results it looks like the first pc axis basically explains the east-west differentiation which indeed explains differences between the two closely related sister species, the eastern and western trees, pc2 seems to explain differences between the northern and southern populations of the western trees, and pc3 explains north/south differences in the eastern trees...

pvalsHist