Closed andrewheusser closed 7 years ago
This sounds good to me-- so the new version of reduce
would work something like:
1) use PPCA to fill in missing data
2) if method=None
(default) we'd default to PCA if the dataset (collectively) had less than our threshold number of observations (10K? 100K?) and IncrementalPCA if the dataset (collectively) had more than that number of observations
3) otherwise we'd default to whatever was specified via method
.
method=PCA
, use PCA
if the number of samples is small, and IncrementalPCA
if the number of samples is large. if method=ICA
, use ICA
if the number of samples is small, and IncrementalICA
if the number of samples is large. In other words, automatically default to a close approximation of what the user specifies that will still run quickly(ish)An alternative (also mentioned in pull request) would be to just default to IncrementalPCA, unless the user explicitly asks for PCA.
starting with 0.3.0, IncrementalPCA is now the default!
_Incremental principal components analysis (IPCA). Linear dimensionality reduction using Singular Value Decomposition of centered data, keeping only the most significant singular vectors to project the data to a lower dimensional space. Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA. This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory. The computational overhead of each SVD is O(batch_size * n_features 2), but only 2 batch_size samples remain in memory at a time. There will be n_samples / batch_size SVD computations to get the principal components, versus 1 large SVD of complexity O(n_samples n_features 2) for PCA._
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA