ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.83k stars 160 forks source link

consider incremental pca as default for plotting large datasets #134

Closed andrewheusser closed 7 years ago

andrewheusser commented 7 years ago

_Incremental principal components analysis (IPCA). Linear dimensionality reduction using Singular Value Decomposition of centered data, keeping only the most significant singular vectors to project the data to a lower dimensional space. Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA. This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory. The computational overhead of each SVD is O(batch_size * n_features 2), but only 2 batch_size samples remain in memory at a time. There will be n_samples / batch_size SVD computations to get the principal components, versus 1 large SVD of complexity O(n_samples n_features 2) for PCA._

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA

jeremymanning commented 7 years ago

This sounds good to me-- so the new version of reduce would work something like:

1) use PPCA to fill in missing data 2) if method=None (default) we'd default to PCA if the dataset (collectively) had less than our threshold number of observations (10K? 100K?) and IncrementalPCA if the dataset (collectively) had more than that number of observations 3) otherwise we'd default to whatever was specified via method.

jeremymanning commented 7 years ago

An alternative (also mentioned in pull request) would be to just default to IncrementalPCA, unless the user explicitly asks for PCA.

andrewheusser commented 7 years ago

starting with 0.3.0, IncrementalPCA is now the default!