extend DataGeometry objects to provide access to "transform" and "inverse" methods

jeremymanning commented 6 years ago

The new DataGeometry objects allow users to apply the same sequences of operations (that were applied to old data) to new data. However, the parameters of those transformations are currently fit anew for the new data. Therefore it is still inelegant to map new data onto the same space as old data without refitting the model.

The fundamental challenge is that it is not always a straightforward process to fit the transformation parameters and then map new data onto a lower dimensional space. For example, the scikit learn dimensionality reduction functions (which hypertools.reduce wraps) don't have fully consistent APIs.

The two basic functions that would be useful to expose are:

model.transform: use a fitted model to map new data onto a lower dimensional space
model.inverse: take a low-dimensional representation and map it back onto the original high-dimensional feature space

In the long run, it would be nice to support both of these operations for all of the dimensionality reduction models we support in hypertools.reduce. However, as a compromise that might be relatively easy to implement, I suggest that we extend the DataGeometry object to include transform and inverse functions. We should (initially) only support the simplest cases where:

No normalization is applied
No alignment is applied
transform is set to either model.transform (if the transform function is supported for that model in scikit-learn) or a null "identity" function that just returns whatever data it's passed and outputs a warning message (if the transform function is unsupported for that model in scikit-learn, or if normalization or alignment have been applied to the data)
inverse is set to either model.inverse_transform (if supported for that model in scikit-learn) or the null identity function (if the inverse_transform function is unsupported for that model in scikit-learn, or if normalization or alignment have been applied to the data)

I've compiled a list of which functions support the transform and inverse_transform methods: [LINK]

Eventually we can also support normalized data (by saving the normalization parameters) and aligned data (by saving the alignment parameters). Essentially we need to save enough so that we can invert those transformations. I don't think this would be fundamentally difficult, but we just need to think through the right way to implement it.

jeremymanning commented 6 years ago

This paper introduces "parametric t-SNE" that can be used to transform new data (using the same transformation that was learned from training data). There's no scikit-learn implementation, but the paper includes a link to a MATLAB implementation (that we could potentially translate into Python): http://ticc.uvt.nl/∼lvdrmaaten/tsne

andrewheusser commented 6 years ago

looks like he has a python implementation on this page: https://lvdmaaten.github.io/tsne/

jeremymanning commented 6 years ago

I can't see how to transform new data (in either the python or matlab versions), but I haven't gone through the paper in detail.

I found the paper above through this discussion: https://www.researchgate.net/post/How_to_integrate_new_data_in_a_TSNE_map

One suggestion from that discussion is to "train a multivariate regressor to predict the map location from the input data." That's at least straightforward to implement...we could see how it works in practice.

BTW, solving the transform issue is going to be important for supporting streaming data-- I think we'll end up wanting to train the mapping onto a low-D space using some initial data, and then we can apply that mapping to new data.

And solving the inverse issue will be important for doing analyses like the Raiders analysis from our paper entirely within the toolbox-- i.e. applying a (different) set of transformations to two datasets, and then using the inverse of one of those sets of transformations to map one dataset back onto the original high-dimensional space of the other dataset.

ContextLab / hypertools

extend DataGeometry objects to provide access to "transform" and "inverse" methods #161