erdogant / pca

pca: A Python Package for Principal Component Analysis.
https://erdogant.github.io/pca
MIT License
284 stars 42 forks source link

Logistic PCA and PPMI-based methods? #36

Open BradKML opened 1 year ago

BradKML commented 1 year ago

Currently I am awaiting datasets with a data format of "liked items by user", and that certain items are similar in nature. Currently there are a few ways of reducing dimensionality:

What are the trade-off and characteristics of each method? Are there other methods for large number of binary data columns?

erdogant commented 1 year ago

It depends on the research question which method to use. But if you start with exploration, an unsupervised approach is always a good starting point. Try the package clusteval. Make sure to use the appropriate metric, such as hamming distance.

Or you can use hypergeometric tests to find significant overlapping features. In this case try HNet library. More details can be found in this blog].

Perhaps SVD analysis is more appropriate than PCA (this is optional in the pca library). Or indeed your suggestion, logistic PCA.

BradKML commented 1 year ago

For some clarity, I attempted to run Logistic PCA as the Python implementation but it crashed twice playing with VES performance vs personality study, which has personality Yes/No question. Maybe the native implementation eats up too much memory. "Significant overlapping features" is one of the things I am seeking with PCA-like methods, but that the data is extremely binary.

Q: why is cluster evaluation useful in a binary data dimensionality reduction + feature selection + regression task?

BradKML commented 1 year ago

Also, secondary discovery: