Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

Genotype likelihood matrix format inside pcangsd (L) #51

Closed alxsimon closed 3 years ago

alxsimon commented 3 years ago

Hi, I am trying to use internal functions of pcangsd as I already have a numpy matrix of genotype likelihoods with shape (m, n*3), where m is the number of sites and n the number of individuals.

I looked through the code but I think I am missing something, I don't really understand how you switch from the beagle format (m, n3) to the format of the L array (m, n2). Is one of the genotype likelihood dropped and which one? I could not wrap my head around the C code in reader_cy.pyx.

Thanks for your help.

Thanks for your help. Alexis

PS: this is for a small project I have of implementing local PCA for low-coverage data using pcangsd, I plan to contact you at some point but I would like to have a prototype working before that.

Rosemeis commented 3 years ago

Hi Alexis,

Yeah sorry, in the newest version (1.02) I dropped the third genotype likelihood (P(X|G) = 2) for each individual to save some memory. If you don't wanna deal with changing your dataset, you can download the 1.01 version here: https://github.com/Rosemeis/pcangsd/tree/c43b949b05e2e16872ea385eb30f48d6b71246f4

It sounds very interesting with the local PCA! Looking forward to hear more :-)

Best, Jonas

alxsimon commented 3 years ago

Thanks for the quick answer Jonas. Great, now I understand the C code. I think I will stick with the newer version for easier compatibility and for memory saving as well.

My main objective is to be able to try local PCA as done in the package lostruct (they even started a python implementation), but they start from called genotypes in vcf format. Not ideal when working with low-coverage data. I am not sure I will be able to do this quickly but if I manage to get this working it would be nice to have this implemented, either inside PCAngsd or as a standalone package.

More to come at some point. Best, Alexis