Closed andrewheusser closed 7 years ago
proposed API:
reduced_data = hyp.tools.reduce(data, method='TSNE')
#or
hyp.plot(data, method='TSNE')
planning to include all manifold learning algorithms supported by scikit-learn
:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold
we could consider other decomposition methods as well (like ICA):
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition
pseudo-code:
# first interpolate missing data if there is any
if missing_data:
m = PPCA(n_components=data.size[1])
data = m.fit_transform(data)
# dictionary of models
models = {
'PCA': sklearn.decomposition.PCA,
'TSNE': sklearn.manifold.TSNE,
...
}
# then, reduce
return models[method](n_components=ndims).fit_transform(data)
@jeremymanning let me know what you think
actually method
should probably be model
we should also allow the user to change defaults, which are specific to each reduction model. perhaps:
params = {
'perplexity':40
}
reduced_data = hyp.tools.reduce(data, model='TSNE', model_params=params)
#and
hyp.plot(data, model='TSNE', model_params=params)
i like the perplexity idea-- e.g. have some way of defining the number of dimensions based on some other stopping criteria. this would be easy for some dimensionality reduction algorithms (e.g. PCA, where we can simply take the top k
factors as a k
-dimensional reduction of the data) and harder for others (e.g. for MDS and tSNE, the solution completely changes and needs to be re-fit for different numbers of dimensions). so i'd say this idea needs more discussion in order to decide on an efficient/reasonable way of implementing it.
adding link to this comment here-- thoughts on reduction API
to expand on my most recent comment-- that link is to a proposed API for data reduction. it's along the lines of what you proposed, with some additional thoughts. @andrewheusser can you take a look at what i wrote here and then we can stabilize on an API?
@jeremymanning - re your perplexity comment above - To clarify, my suggestion wasn't about perplexity specifically, but rather about an easy way to pass parameters to each individual decomposition model. Each scikit-learn model has its own set of predefined parameters (see http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition). So, a dictionary seems like a good way to do it.
Regarding the reduction API, I like the idea of having a flag to return the model. To clarify, all of scikit-learns decomposition algs (PCA, NMF, etc) have the same basic API (fit, transform, fit_transform). All of the manifold learning algs (tSNE, MDS, etc) have the same basic API, but lack the transform method. Maybe I'm mistaken, but possibly this is because you can't have a transform function for an algorithm that is specific to the dataset to which it is fit. For example, MDS computes a lower dimensional embedding for the specific data you pass it, so its not clear to me what a transform function would look like.
Finally, re implementing the algs ourselves....I'm thinking it may be best to rely on scikit-learn as much as we can. They have a massive community supporting the project, and any improvements/enhancements made to the package will directly benefit us. In my mind, the real value of our package is to make its super easy to visualize high dimensional data, not necessarily be a one stop shop for dimensionality reduction algorithms because scikit already does that really well.
I understood the distinction between the APIs for decomposition vs. manifold learning algorithms in scikit-learn. There may be good reasons for that in scikit-learn, but I don't think we want that design for hypertools.
For PCA and ICA, what the fit
function learns is a mapping matrix that takes a set of points in n
-dimensional space and maps it onto a k
-dimensional space using a matrix multiplication. MDS and tSNE also need to have this mapping function, but those mapping functions are non-linear...but we can still apply those same transformations to new data. It's true that the solutions might not be optimal for new data, but that's not something unique to MDS or tSNE-- the same is true for PCA, ICA, etc.
In terms of implementing the algorithms ourselves, clearly if we can write wrappers for scikit-learn code, then we should do that. However, if the scikit-learn code doesn't have the functionality we want/need, we'll need to adapt accordingly (and either find another package, like we did for PPCA, or write new code, like we did for hyperalignment).
Regarding the "one stop shop for dimensionality reduction algorithms" comment-- I agree that that's not all hypertools should do, but I don't think the functionality I'm proposing is redundant with scikit-learn. For example:
reduced = hyp.tools.reduce(data, method='PCA', align=True)
)So while we should leverage existing code and the excellent work of other toolboxes where possible, hypertools is more about user convenience than replicating functionality in scikit-learn (or any other package).
also: thanks for clarifying the perplexity idea. i think your design is potentially good...are you thinking we'll handle those arguments with keyword arguments like we do for plot parameters, and pass along any non-hypertools function to the appropriate sklearn functions?
:) cool, I get what you are saying now. Thanks for explaining the decomposition/manifold implementation details
It does look like there is discussion around building transform functions for tSNE (https://github.com/scikit-learn/scikit-learn/issues/5361#issuecomment-147552086) and maybe the other manifold learning algs as well. Not sure what stage it is at though. We should take a look at these discussions
I do see the added functionality we offer over scikit (supporting lists, PPCA, hyperalignment, super easy API) and agree that these are very useful tools to have.
I'm wondering whether it would be (most) beneficial to the open science community if we tried to merge some of these features into scikit-learn (in particular PPCA and hyperalignment), so that they can be used with scikit directly, and then call them into our package. We could keep them local to our package for now, and then if they are eventually merged in, we could switch over.
For the model parameters (perplexity etc), I was thinking that we would initialize a new kwarg in the reduce
and plot
function called model_params
:
params = {
'perplexity':40
}
reduced_data = hyp.tools.reduce(data, model='TSNE', model_params=params)
#and
hyp.plot(data, model='TSNE', model_params=params)
#not
hyp.plot(data, model='TSNE', perplexity=40)
since the 'left over' kwargs to the plot function currently get passed on to matplotlib, the last line would be hard to parse internally (we'd have to account for all possible non-maplotlib kwargs and process them or an error would be thrown).
Internally, model_params
would be unpacked when creating a scikit-learn model object:
m = PCA(**model_params)
ideally we could find already-implemented versions of all of these algorithms, and either call scikit-learn or someone else's code to implement our versions. for example:
_externals
depending on the license)your params
idea looks reasonable to me...
I'm almost done with a PR adding additional reduction models to hypertools. There are quite a few techniques that use the n_components
syntax: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition - do we want to support all of them, or a subset? The rationale to include all of them is that we don't know ahead of time what people will want to use (although we can guess what is popular). In addition to these decomposition models, there are embedding techniques that we could support as well: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold including TSNE, MDS, etc. Note: this implementation does not extend the API to include fit
transform
etc, but just adds support for data reduction using additional models, e.g.
hyp.plot(data, model='TSNE')
#or
reduced_data = hyp.tools.reduce(data, model='TSNE', model_params={'perplexity':40}, ndims=3)
Here's a full list that use the n_components
syntax:
PCA, FastICA, IncrementalPCA, ProjectedGradientNMF, KernelPCA, FactorAnalysis, TruncatedSVD, NMF, SparsePCA, MiniBatchSparsePCA, DictionaryLearning, MiniBatchDictionaryLearning, TSNE, MDS, SpectralEmbedding, LocallyLinearEmbedding, Isomap
so we could easily include all of them.
errors (such as passing a matrix with negative values to non-negative matrix factorization) would be handled by scikit-learn
It would be fantastic to include all of these algorithms.
We could also add a "someday" issue to compute some quick stats on the data initially and then use some sort of heuristic to select which algorithm to us based on the properties of the data. But that's definitely outside of the scope of this issue.
Re: returning the model, this issue could be expanded to return a model object that allows the user to examine the model parameters and apply the same transformations to new data. This would be important for (a) mapping the reduced data back onto the original feature space (e.g. to facilitate decoding or interpreting) and (b) supporting these models with streaming data (where the transform is computed on early data and then applied to new data).
👍 - all but two of these are working: ProjectedGradientNMF and NMF - ProjectedGradientNMF is being deprecated, and NMF (apparently) requires a sparse (and non-negative) matrix as input, which is a dtype we currently don't support. @jeremymanning: should I try to add support for the sparse matrix dtype, or remove NMF from supported dim reduction techniques?
Let's keep things simple and get rid of both of those (ProjectedGradientNMF and NMF).
(also: awesome!)
👍
included in 0.3.0 release
note: do PPCA first (to fill in missing data), then call reduction model