Integrating dimensionality reduction into the pipeline

htcai commented 8 years ago

It will benefit all of us if the operations of dimensionality reduction can be integrated into the pipeline.

Moreover, it seems necessary to place dimensionality reduction after preliminary feature selection (keeping 5000?); otherwise, our computers are likely to run out of memory.

dhimmel commented 8 years ago

Thanks for posting this issue @htcai.

To recap for those who weren't at the meetup last night: we have an expression matrix with 7,306 samples (rows) × 20,530 genes (columns). We want to reduce the dimensionality of the genes, using a technique such as PCA. However, we were running into memory issues when using the algorithms in sklearn.

Tagging @gheimberg who has experience with applying these methods to gene expression datasets. @gheimberg and others, is the best solution to reduce the memory issue to perform feature selection before applying feature extraction?

yl565 commented 8 years ago

Which class has been tried? RandomizedPCA should cost less memory than PCA

dhimmel commented 8 years ago

@yl565, I don't remember anyone trying sklearn.decomposition.RandomizedPCA, which looks like it's designed to solve this problem. For reference, sklearn cites the following two studies: Halko et al 2011 and Martinsson et al 2011.

So I guess we should compare the performance of classifier pipelines which use:

an approximate decomposition
feature selection followed by an exact decomposition

dhimmel commented 8 years ago

Also tagging @NabeelSarwar.

htcai commented 8 years ago

@dhimmel @yl565 Thank you for the references and discussions! Should everyone claim an algorithm of dimensionality reduction (including a choice between 1 vs. 2)?

dhimmel commented 8 years ago

Should everyone claim an algorithm of dimensionality reduction (including a choice between 1 vs. 2)?

Great idea! Let people know which one you choose below. So we're all on the same page, make sure you're using the latest data retrieved by 1.download.ipynb. I recommend starting with algorithms/SGDClassifier-master.ipynb.

It may also be nice to print out max memory usage at the end of the script (not sure if this will work on all OSes):

# Get peak memory usage in kilobytes
# https://docs.python.org/3/library/resource.html#resource.getrusage
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

beelze-b commented 8 years ago

I will work on factor analysis.

htcai commented 8 years ago

I would like to try Linear Discriminant Analysis (LDA) after feature selection. I will look for other commands that can report max memory usage if the one above does not work.

Also, maybe we should select a uniform number of features. For instance, we select 5000 and then reduce the dimensionality to 2000 or 500.

beelze-b commented 8 years ago

I suggest 2000 to keep at least 10%. This can be fine tuned by algorithms that report on the information contained within each component. But I think we should err on the side of more features.

yl565 commented 8 years ago

I tried PCA, seems to be running on my computer just fine

pipeline = make_pipeline(
    PCA(n_components=500), 
    StandardScaler(),  # Feature scaling
    clf_grid)

yl565 commented 8 years ago

Peak memory is about 9GB on Ubuntu

dhimmel commented 8 years ago

@yl565 is it important to standardize before performing PCA?

Also IncrementalPCA may also circumvent memory issues.

beelze-b commented 8 years ago

Most of these algorithms will also do whitening and building the covariance matrix for you, or so I thought.

yl565 commented 8 years ago

From PCA source code seems it is demeaned but not standardized. Standardizing may help improve classification performance.

The following figures shows the memory cost of the three algorithms, either RandomizedPCA or IncrementalPCA (I used n_batch=1000) should work fine. All three produces classification test AUROC = 0.93. For reduce memory cost maximumly, IncrementalPCA is better at the cost of longer computational time. figure_1-3

beelze-b commented 8 years ago

I believe we only tried Factor Analysis and LDA. I ran out of memory using Factor inside the pipeline with the randomized solution. This was without selecting some features before hand. I will try to get some updates on the memory usage before the weekend.

dhimmel commented 8 years ago

@yl565 really cool analysis. Can you link to the source code? If you just need a quick place to upload a file, you can check out GitHub Gists.

So here is my interpretation of your plot. It looks like loading the data peaks at ~4.5 GB of memory and stabilizes around 4 GB -- hence 32 bit systems run into a memory error. PCA appears to require an additional ~5 GB of memory. RandomizedPCA requires ~2 GB. IncrementalPCA required ~1.8 GB.

PCA and RandomizedPCA took about the same runtime, while IncrementalPCA took ~30% longer.

According to the sklearn docs:

The IncrementalPCA object uses a different form of processing and allows for partial computations which almost exactly match the results of PCA while processing the data in a minibatch fashion.

Depending on the extent of "almost exactly match", I think a good option is to use PCA/IncrementalPCA if we expect there to be a memory issue. However, it's also worth noting that the peak memory usage of 9 GB can be handled by many systems. Therefore, I still think it makes sense to try algorithms without an out-of-core (partial_fit) implementation.

mans2singh commented 8 years ago

@yl565 - Are you working with PCA or IncrementalPCA ? I started on PCA but if you are working on it, I can try IncrementalPCA.

yl565 commented 8 years ago

@dhimmel the source code: https://gist.github.com/yl565/caf34bce62cb0fb4fa0c1a26a298e1d6 Use memory_profiler to run the code from command line: mprof run test_PCA_peak_memory.py mprof plot

@mans2singh I'm not currently working on either. They should produce same (or very close) results. You could try using PCA first and if run out of memory try IncrementalPCA instead

cognoma / machine-learning

Integrating dimensionality reduction into the pipeline #43