consider incremental pca as default for plotting large datasets

andrewheusser commented 7 years ago

_Incremental principal components analysis (IPCA). Linear dimensionality reduction using Singular Value Decomposition of centered data, keeping only the most significant singular vectors to project the data to a lower dimensional space. Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA. This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory. The computational overhead of each SVD is O(batch_size * n_features 2), but only 2 batch_size samples remain in memory at a time. There will be n_samples / batch_size SVD computations to get the principal components, versus 1 large SVD of complexity O(n_samples n_features 2) for PCA._

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA

jeremymanning commented 7 years ago

This sounds good to me-- so the new version of reduce would work something like:

1) use PPCA to fill in missing data 2) if method=None (default) we'd default to PCA if the dataset (collectively) had less than our threshold number of observations (10K? 100K?) and IncrementalPCA if the dataset (collectively) had more than that number of observations 3) otherwise we'd default to whatever was specified via method.

option 1: just use whatever the user specifies
option 2: if method=PCA, use PCA if the number of samples is small, and IncrementalPCA if the number of samples is large. if method=ICA, use ICA if the number of samples is small, and IncrementalICA if the number of samples is large. In other words, automatically default to a close approximation of what the user specifies that will still run quickly(ish)

jeremymanning commented 7 years ago

An alternative (also mentioned in pull request) would be to just default to IncrementalPCA, unless the user explicitly asks for PCA.

andrewheusser commented 7 years ago

starting with 0.3.0, IncrementalPCA is now the default!

ContextLab / hypertools

consider incremental pca as default for plotting large datasets #134