Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
47 stars 11 forks source link

How is the K chosen in the program? #82

Closed yksakana closed 3 months ago

yksakana commented 11 months ago

Hi, I apologize for the multiple posts. I am currently working on admixture analysis using pcangsd utilizing the following command: pcangsd -b genolike.beagle.gz --iter 2000 -e 10 -o pcangsd_out --admix

I've encountered a bit of confusion about the choice of the number K in the program and its connection to the parameter -e, which represents the number of eigenvalues.

Specifically, when I set -e to 10, the program sets K=11. Conversely, when -e is set to 1, the program sets K=2. Does this imply that K is set to e+1?

Additionally, I'm keen to compare the Frobenius error when K is 1 and 2. Even if I set -e to 0, the program defaults to K=2, preventing me from obtaining the error for K=1. I am aware of the option to manually select K using -admix_K, but as it's not recommended (please see here), I'm unsure how to obtain the Frobenius error when K=1.

Should I proceed by specifying the value of K using the -admix_K option, or are there alternative methods to obtain the Frobenius error for K=1?

Thank you for your assistance.

Rosemeis commented 11 months ago

Hi,

Yes it is directly implied that K = e + 1, as if you have K distinct populations, then you would only need e = K - 1 eigenvectors to model them. (https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190)

I'm not sure I understand the last question. It seems you want to estimate admixture proportions when only assuming 1 ancestral population?

Best, Jonas

yksakana commented 10 months ago

Hi,

Thank you very much for your reply, and sorry for my late reply. I understood that K would be equal to e+1, thank you very much.

In the last question, Yes. I'd like to know which better explains my data, assuming 1 cluster or 2 clusters. For example, in the ADMIXTURE software, we can judge which would be better by comparing cross-validation errors. How can I do it in pcangsd?

Best, Yusuke

Rosemeis commented 10 months ago

Unfortunately there is no feature at the moment to perform CV for admixture estimation in PCAngsd right now. If the "-e" argument is set as default ("-e 0") then it will automatically infer the number of PCs needed for the iterative PCA approach, which can give a hint of the underlying K (assuming K = e+1). We perform a MAP test to infer "e" (https://doi.org/10.1038/hdy.2011.26).

PCAngsd was developed for structured population, so we unfortunately have not included the possibility to set (K=1). Here the individual allele frequencies would simply be the standard allele frequencies.

yksakana commented 10 months ago

Hi, Thank you very much for your response. I appreciate your detailed explanation!

Best, Yusuke