erdogant / pca

pca: A Python Package for Principal Component Analysis.
https://erdogant.github.io/pca
MIT License
284 stars 42 forks source link

Question: plans to extend this to principal coordinates analysis? #2

Closed jolespin closed 4 years ago

jolespin commented 4 years ago

Do you have plans to generalize these methods to principal coordinates analysis so we can use non-Euclidean distance? That would be absolutely incredible and would use this for all of my projects.

erdogant commented 4 years ago

Dear Jolespin,

Thank you for using the pca library! This version of pca can handle one hot data (with SparsePCA) and sparse matrixes by the truncated SVD integration. What kind of data distribution (and metric) did you had in mind? Methods like MDS, LDA, SVD, UMAP can be helpful too for dimensionality reduction.

Verstuurd vanaf mijn iPhone

Op 20 jun. 2020 om 22:16 heeft Josh L. Espinoza notifications@github.com het volgende geschreven:  Do you have plans to generalize these methods to principal coordinates analysis so we can use non-Euclidean distance? That would be absolutely incredible and would use this for all of my projects.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jolespin commented 4 years ago

I used custom distances as input. It would be cool to visualize the loadings of these. For example,

df_dism = # Custom distance matrix (n,n)

Then I use df_dism as input into PCoA. If I did Euclidean as a distance function then this is the same as PCA. Let me know if you want me to elaborate. The docs on that link help explain a little better. The problem with MDS in sklearn is that it's stochastic and uses a random seed.

jolespin commented 4 years ago

There is a really useful package for bioinformatics called scikit-bio. They have PCoA methods in them and a way to calculate loadings from these objects. I did a quick and dirty adaptation of your biplot plotting function to use custom loadings. It's all detailed here: https://github.com/biocore/scikit-bio/issues/1710

I guess PCoA might be out of scope for your pca project but I was curious on how biplots work and a combination of your source code and the issue above was very helpful for me to understand.

I'm going to close this unless you think otherwise. Cheers