Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

Selection scan - positions? #56

Open hyperoplus opened 2 years ago

hyperoplus commented 2 years ago

Hi everyone,

I have a very basic question about the selection scan from PCAngsd, but can't seem to work this out. I ran PCAngsd with the -selection flag as advised in the tutorial, which generated a pop.selection.npy file. I converted this numpy file into a regular column-based format with R, again following the suggested pipeline, but the output with p-values that is generated has no columns with chromosome and position information, only three columns with the statistic. The number of rows is not coincident with either the number of positions in the .sited file I provided to angsd when generating the beagle file, nor the number of rows in the beagle/maf output files from Angsd (although it is close to half of the number of sites).

How can I retrieve the positional information for the statistic that was generated with the -selection flag?

Rosemeis commented 2 years ago

Hi,

PCAngsd only outputs the selection statistics and not information of the sites. When generating genotype likelihoods in ANGSD, you should also get a "filename.mafs.gz", with all information of the sites. If you used -minMaf 0.05 in ANGSD, the sites will be identically with the selection statistics of PCAngsd (PCAngsd automatically filter MAF with a threshold of 0.05). Otherwise you can use -sites_save in PCAngsd to obtain a logical vector that you can use to extract the corresponding sites in the "filename.mafs.gz" file.

Best, Jonas

hyperoplus commented 2 years ago

Hi Jonas, thanks for the answer. I did not impose a minMaf when generating the beagle file, so that is probably the reason why the number of rows in the MAF file and the selection files do not match up. I will experiment with this and will let you know ASAP

zactobias44 commented 2 years ago

Dear Jonas,

I am experiencing a similar issue. When generating the beagle file, I specified a minMaf of 0.01. Obviously, when I run pcangsd with the default minMaf of 0.05, I get fewer sites in the end. However, the sum of the .sites output from pcangsd -selection -sites_save is different that the number of sites in .mafs.gz >= maf 0.05. It's not off by much, but was peculiar. Is there a difference in how maf is calculated from the beagle in pcangsd vs. angsd?

>zcat Bs_CPUPDP_broad_restricted.mafs.gz | tail -n +2 | awk '$7>=0.05' | wc -l
2126237

> awk '$1==1' Bs_CPUPDP_broad_restricted.sites | wc -l
2129436
Rosemeis commented 2 years ago

Hi,

It should be the same EM algorithm, so I imagine this is just a matter of different convergence threshold used. You can try and test it if you want to with the argument "-maf_tole" for your dataset. Default value in pcangsd is 1e-4, where as in ANGSD (http://www.popgen.dk/angsd/index.php/Allele_Frequencies), it seems to be 1e-3 if I'm reading the documentation correctly. But I think our initial vector guesses might differ as well, so it may be impossible to recreate. :-)

Best, Jonas

zactobias44 commented 2 years ago

Thanks Jonas! Since the difference is so small, I will not worry too much about it. I don't know much about how the algorithm works and was not aware it is non-deterministic, but thought you should know. Thanks for clearing that up!

Cheers,

Zac