Logistic PCA and PPMI-based methods?

erdogant / pca

pca: A Python Package for Principal Component Analysis.

https://erdogant.github.io/pca

MIT License

284 stars 42 forks source link

Logistic PCA and PPMI-based methods? #36

Open BradKML opened 1 year ago

BradKML commented 1 year ago

Currently I am awaiting datasets with a data format of "liked items by user", and that certain items are similar in nature. Currently there are a few ways of reducing dimensionality:

Logistic PCA, which uses logit curves to render binary information similar to scalar data, data as either +1 or -1 https://github.com/brudfors/logistic-PCA-Tipping/blob/main/pca.py#L6
PPMI-based methods that uses co-occurrence of tags or words within images or sentences https://aclanthology.org/L18-1156.pdf https://github.com/Bollegala/svdmi

What are the trade-off and characteristics of each method? Are there other methods for large number of binary data columns?

erdogant commented 1 year ago

It depends on the research question which method to use. But if you start with exploration, an unsupervised approach is always a good starting point. Try the package clusteval. Make sure to use the appropriate metric, such as hamming distance.

Or you can use hypergeometric tests to find significant overlapping features. In this case try HNet library. More details can be found in this blog].

Perhaps SVD analysis is more appropriate than PCA (this is optional in the pca library). Or indeed your suggestion, logistic PCA.

BradKML commented 1 year ago

For some clarity, I attempted to run Logistic PCA as the Python implementation but it crashed twice playing with VES performance vs personality study, which has personality Yes/No question. Maybe the native implementation eats up too much memory. "Significant overlapping features" is one of the things I am seeking with PCA-like methods, but that the data is extremely binary.

Q: why is cluster evaluation useful in a binary data dimensionality reduction + feature selection + regression task?

BradKML commented 1 year ago

Also, secondary discovery:

some are saying that MCA (correspondence analysis) are good for binary and categorical applications, but if so how can that turn into factor models? https://github.com/MaxHalford/prince
"Correlation Explanation" has been used for bio-informatic data, which is often binary. However they tend to behave like ICAs, albiet similar to PCA in that it does not assume data independence. https://github.com/gregversteeg/CorEx