ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.83k stars 160 forks source link

enh: add support for all scikit-learn reduction methods #106

Closed andrewheusser closed 7 years ago

andrewheusser commented 7 years ago

note: do PPCA first (to fill in missing data), then call reduction model

andrewheusser commented 7 years ago

proposed API:

reduced_data = hyp.tools.reduce(data, method='TSNE')
#or
hyp.plot(data, method='TSNE')

planning to include all manifold learning algorithms supported by scikit-learn:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold

we could consider other decomposition methods as well (like ICA):

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

pseudo-code:

# first interpolate missing data if there is any
if missing_data:
  m = PPCA(n_components=data.size[1])
  data = m.fit_transform(data)

# dictionary of models
models = {
  'PCA': sklearn.decomposition.PCA,
  'TSNE': sklearn.manifold.TSNE,
  ...
}

# then, reduce
return models[method](n_components=ndims).fit_transform(data)

@jeremymanning let me know what you think

andrewheusser commented 7 years ago

actually method should probably be model

andrewheusser commented 7 years ago

we should also allow the user to change defaults, which are specific to each reduction model. perhaps:

params = {
  'perplexity':40
}
reduced_data = hyp.tools.reduce(data, model='TSNE', model_params=params)
#and
hyp.plot(data, model='TSNE', model_params=params)
jeremymanning commented 7 years ago

i like the perplexity idea-- e.g. have some way of defining the number of dimensions based on some other stopping criteria. this would be easy for some dimensionality reduction algorithms (e.g. PCA, where we can simply take the top k factors as a k-dimensional reduction of the data) and harder for others (e.g. for MDS and tSNE, the solution completely changes and needs to be re-fit for different numbers of dimensions). so i'd say this idea needs more discussion in order to decide on an efficient/reasonable way of implementing it.

jeremymanning commented 7 years ago

adding link to this comment here-- thoughts on reduction API

jeremymanning commented 7 years ago

to expand on my most recent comment-- that link is to a proposed API for data reduction. it's along the lines of what you proposed, with some additional thoughts. @andrewheusser can you take a look at what i wrote here and then we can stabilize on an API?

andrewheusser commented 7 years ago

@jeremymanning - re your perplexity comment above - To clarify, my suggestion wasn't about perplexity specifically, but rather about an easy way to pass parameters to each individual decomposition model. Each scikit-learn model has its own set of predefined parameters (see http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition). So, a dictionary seems like a good way to do it.

Regarding the reduction API, I like the idea of having a flag to return the model. To clarify, all of scikit-learns decomposition algs (PCA, NMF, etc) have the same basic API (fit, transform, fit_transform). All of the manifold learning algs (tSNE, MDS, etc) have the same basic API, but lack the transform method. Maybe I'm mistaken, but possibly this is because you can't have a transform function for an algorithm that is specific to the dataset to which it is fit. For example, MDS computes a lower dimensional embedding for the specific data you pass it, so its not clear to me what a transform function would look like.

Finally, re implementing the algs ourselves....I'm thinking it may be best to rely on scikit-learn as much as we can. They have a massive community supporting the project, and any improvements/enhancements made to the package will directly benefit us. In my mind, the real value of our package is to make its super easy to visualize high dimensional data, not necessarily be a one stop shop for dimensionality reduction algorithms because scikit already does that really well.

jeremymanning commented 7 years ago

I understood the distinction between the APIs for decomposition vs. manifold learning algorithms in scikit-learn. There may be good reasons for that in scikit-learn, but I don't think we want that design for hypertools.

For PCA and ICA, what the fit function learns is a mapping matrix that takes a set of points in n-dimensional space and maps it onto a k-dimensional space using a matrix multiplication. MDS and tSNE also need to have this mapping function, but those mapping functions are non-linear...but we can still apply those same transformations to new data. It's true that the solutions might not be optimal for new data, but that's not something unique to MDS or tSNE-- the same is true for PCA, ICA, etc.

In terms of implementing the algorithms ourselves, clearly if we can write wrappers for scikit-learn code, then we should do that. However, if the scikit-learn code doesn't have the functionality we want/need, we'll need to adapt accordingly (and either find another package, like we did for PPCA, or write new code, like we did for hyperalignment).

Regarding the "one stop shop for dimensionality reduction algorithms" comment-- I agree that that's not all hypertools should do, but I don't think the functionality I'm proposing is redundant with scikit-learn. For example:

So while we should leverage existing code and the excellent work of other toolboxes where possible, hypertools is more about user convenience than replicating functionality in scikit-learn (or any other package).

jeremymanning commented 7 years ago

also: thanks for clarifying the perplexity idea. i think your design is potentially good...are you thinking we'll handle those arguments with keyword arguments like we do for plot parameters, and pass along any non-hypertools function to the appropriate sklearn functions?

andrewheusser commented 7 years ago

:) cool, I get what you are saying now. Thanks for explaining the decomposition/manifold implementation details

It does look like there is discussion around building transform functions for tSNE (https://github.com/scikit-learn/scikit-learn/issues/5361#issuecomment-147552086) and maybe the other manifold learning algs as well. Not sure what stage it is at though. We should take a look at these discussions

I do see the added functionality we offer over scikit (supporting lists, PPCA, hyperalignment, super easy API) and agree that these are very useful tools to have.

I'm wondering whether it would be (most) beneficial to the open science community if we tried to merge some of these features into scikit-learn (in particular PPCA and hyperalignment), so that they can be used with scikit directly, and then call them into our package. We could keep them local to our package for now, and then if they are eventually merged in, we could switch over.

For the model parameters (perplexity etc), I was thinking that we would initialize a new kwarg in the reduce and plot function called model_params:

params = {
  'perplexity':40
}
reduced_data = hyp.tools.reduce(data, model='TSNE', model_params=params)
#and
hyp.plot(data, model='TSNE', model_params=params)
#not
hyp.plot(data, model='TSNE', perplexity=40)

since the 'left over' kwargs to the plot function currently get passed on to matplotlib, the last line would be hard to parse internally (we'd have to account for all possible non-maplotlib kwargs and process them or an error would be thrown).

Internally, model_params would be unpacked when creating a scikit-learn model object: m = PCA(**model_params)

jeremymanning commented 7 years ago

ideally we could find already-implemented versions of all of these algorithms, and either call scikit-learn or someone else's code to implement our versions. for example:

your params idea looks reasonable to me...

andrewheusser commented 7 years ago

I'm almost done with a PR adding additional reduction models to hypertools. There are quite a few techniques that use the n_components syntax: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition - do we want to support all of them, or a subset? The rationale to include all of them is that we don't know ahead of time what people will want to use (although we can guess what is popular). In addition to these decomposition models, there are embedding techniques that we could support as well: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold including TSNE, MDS, etc. Note: this implementation does not extend the API to include fit transform etc, but just adds support for data reduction using additional models, e.g.

hyp.plot(data, model='TSNE')
#or
reduced_data = hyp.tools.reduce(data, model='TSNE', model_params={'perplexity':40}, ndims=3)
andrewheusser commented 7 years ago

Here's a full list that use the n_components syntax:

PCA, FastICA, IncrementalPCA, ProjectedGradientNMF, KernelPCA, FactorAnalysis, TruncatedSVD, NMF, SparsePCA, MiniBatchSparsePCA, DictionaryLearning, MiniBatchDictionaryLearning, TSNE, MDS, SpectralEmbedding, LocallyLinearEmbedding, Isomap

so we could easily include all of them.

andrewheusser commented 7 years ago

errors (such as passing a matrix with negative values to non-negative matrix factorization) would be handled by scikit-learn

jeremymanning commented 7 years ago

It would be fantastic to include all of these algorithms.

We could also add a "someday" issue to compute some quick stats on the data initially and then use some sort of heuristic to select which algorithm to us based on the properties of the data. But that's definitely outside of the scope of this issue.

Re: returning the model, this issue could be expanded to return a model object that allows the user to examine the model parameters and apply the same transformations to new data. This would be important for (a) mapping the reduced data back onto the original feature space (e.g. to facilitate decoding or interpreting) and (b) supporting these models with streaming data (where the transform is computed on early data and then applied to new data).

andrewheusser commented 7 years ago

👍 - all but two of these are working: ProjectedGradientNMF and NMF - ProjectedGradientNMF is being deprecated, and NMF (apparently) requires a sparse (and non-negative) matrix as input, which is a dtype we currently don't support. @jeremymanning: should I try to add support for the sparse matrix dtype, or remove NMF from supported dim reduction techniques?

jeremymanning commented 7 years ago

Let's keep things simple and get rid of both of those (ProjectedGradientNMF and NMF).

jeremymanning commented 7 years ago

(also: awesome!)

andrewheusser commented 7 years ago

👍

andrewheusser commented 7 years ago

included in 0.3.0 release