ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.82k stars 160 forks source link

extend DataGeometry objects to provide access to "transform" and "inverse" methods #161

Open jeremymanning opened 6 years ago

jeremymanning commented 6 years ago

The new DataGeometry objects allow users to apply the same sequences of operations (that were applied to old data) to new data. However, the parameters of those transformations are currently fit anew for the new data. Therefore it is still inelegant to map new data onto the same space as old data without refitting the model.

The fundamental challenge is that it is not always a straightforward process to fit the transformation parameters and then map new data onto a lower dimensional space. For example, the scikit learn dimensionality reduction functions (which hypertools.reduce wraps) don't have fully consistent APIs.

The two basic functions that would be useful to expose are:

In the long run, it would be nice to support both of these operations for all of the dimensionality reduction models we support in hypertools.reduce. However, as a compromise that might be relatively easy to implement, I suggest that we extend the DataGeometry object to include transform and inverse functions. We should (initially) only support the simplest cases where:

I've compiled a list of which functions support the transform and inverse_transform methods: [LINK]

Eventually we can also support normalized data (by saving the normalization parameters) and aligned data (by saving the alignment parameters). Essentially we need to save enough so that we can invert those transformations. I don't think this would be fundamentally difficult, but we just need to think through the right way to implement it.

jeremymanning commented 6 years ago

This paper introduces "parametric t-SNE" that can be used to transform new data (using the same transformation that was learned from training data). There's no scikit-learn implementation, but the paper includes a link to a MATLAB implementation (that we could potentially translate into Python): http://ticc.uvt.nl/∼lvdrmaaten/tsne

andrewheusser commented 6 years ago

looks like he has a python implementation on this page: https://lvdmaaten.github.io/tsne/

jeremymanning commented 6 years ago

I can't see how to transform new data (in either the python or matlab versions), but I haven't gone through the paper in detail.

I found the paper above through this discussion: https://www.researchgate.net/post/How_to_integrate_new_data_in_a_TSNE_map

One suggestion from that discussion is to "train a multivariate regressor to predict the map location from the input data." That's at least straightforward to implement...we could see how it works in practice.

BTW, solving the transform issue is going to be important for supporting streaming data-- I think we'll end up wanting to train the mapping onto a low-D space using some initial data, and then we can apply that mapping to new data.

And solving the inverse issue will be important for doing analyses like the Raiders analysis from our paper entirely within the toolbox-- i.e. applying a (different) set of transformations to two datasets, and then using the inverse of one of those sets of transformations to map one dataset back onto the original high-dimensional space of the other dataset.