Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

output of -selection #70

Open shirasegoby opened 1 year ago

shirasegoby commented 1 year ago

Hello,

Thank you very much for this great tool. I used the pcangsd to find outlier SNP loci.

I used "-selection" flag and obtained an output. The output contained one column. Does this mean that only PC1 is significant and selection statistics along PC1 were outputted?

DanielOsmond commented 1 year ago

To save adding another thread, a similar question here. I'm a little confused as to what the output of the selection.npy file is actually reporting?

For context, I'm running PCAngsd with this command: pcangsd --beagle {inputbeagle}--selection --minMaf 0.05 --threads 16 -o $BASEDIR'angsd/pcangsd'$PREFIX --sites_save

And get an output with this head:

D <- npyLoad("pcangsd_full_snp_selection.selection.npy") # Reads PC based selection statistics View(D) head(D) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0.550135195 0.41988349 2.86682153 0.01884845 0.1577135772 0.24476774 0.27481931 0.282718897 [2,] 0.004351228 1.18976772 0.01015188 0.01219513 0.2708614469 0.27879831 0.68683660 1.323182344 [3,] 1.946982026 0.75790089 0.32707891 0.01051806 0.0303267129 0.01896533 0.07251304 0.008901663 [4,] 0.112891927 3.26745486 1.69801426 0.40741089 0.0003703941 3.15582776 1.04348207 2.947828531 [5,] 0.001022269 0.05364039 0.03050057 0.28526935 0.6652516723 0.65088129 1.93980658 2.141073942 [6,] 0.449037045 0.23769799 0.14198528 1.01102042 0.0241993852 1.01173162 2.27705216 1.996245265

I presume each of the columns is a stat relating to a different PC axis but what is this stat, does it represent PCAngsd-s1/s2? Sorry if this is a silly question but I've scoured through the paper, supp materials and postings here and struggling to get a resolution.

Thank you!

Dan

Rosemeis commented 1 year ago

Hi both of you! Sorry for the late response but I just came back from vacation. :-)

@shirasegoby There is only one PC outputted, as there was only one PC detected (or you might have manually set it to 1) to capture population structure such that only one PC is used to detect selection. In the newer version of PCAngsd, you can still perform selection scans for more PCs by using "--selection_e INT".

@DanielOsmond Yes exactly, each column refer to selection stats of each PC that was detected to capture population structure. The selection scan details are unfortunately not part of the original paper but it is in this one: https://doi.org/10.1186/s12859-021-04375-2 So the selection statistics are chi-square distributed with 1 degree of freedom.

Please feel free to ask more questions! :-)

Best, Jonas

shirasegoby commented 1 year ago

Hi Jonas,

I hope you had a good holiday. I understand well. Thank you very much!!

Best, Shotaro

DanielOsmond commented 1 year ago

Thanks for the reply Jonas, that's exactly what I was after. Thank you for helping!

akimmitt commented 11 months ago

Hello! I similarly was getting 1 column of output for my -pcadapt function when I did not specify "--selection_e" I added to my script "--selection_e 2", and while I'm now getting two columns of data, the columns have different z-scores for PC1 compared to the original PC1 (when --selection_e was not specified). Why would these values be different? Are z-scores not be calculated for the PCs independently?