cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Request: Add Projection to PCA function #323

Open Hjorvik opened 4 years ago

Hjorvik commented 4 years ago

Hi! I'm starting to use scikit allel, and I'll really enjoying it. However, there is a feature that I believe it's missing, and it is the possibility to extract the PCs from some of the samples and then project the rest of the samples to that space. This is specially usefull when you are working with data that have a lot of missing variants (for example, aDNA data), and would be a nice addition to the toolkit. As a popular example of this we have SmartPCA: https://github.com/chrchang/eigensoft/blob/master/POPGEN/lsqproject.pdf

alimanfoo commented 4 years ago

Hi @Hjorvik, apologies for slow reply. IIRC this is already supported to some extent, e.g., if you create a PCA on one array gn1:

coords1, model = allel.pca(gn1)

...then you can use the model to transform a different array gn2, e.g.:

coords2 = model.transform(gn2)

Would this suffice, or do you also need to be able to persist the model somehow so you can run the initial PCA and then do the projection in different sessions?

alimanfoo commented 4 years ago

(Adding link to the allel.stats.decomposition source code if anyone is wondering how this works.)