Open jeremymanning opened 6 years ago
This paper introduces "parametric t-SNE" that can be used to transform new data (using the same transformation that was learned from training data). There's no scikit-learn implementation, but the paper includes a link to a MATLAB implementation (that we could potentially translate into Python): http://ticc.uvt.nl/∼lvdrmaaten/tsne
looks like he has a python implementation on this page: https://lvdmaaten.github.io/tsne/
I can't see how to transform new data (in either the python or matlab versions), but I haven't gone through the paper in detail.
I found the paper above through this discussion: https://www.researchgate.net/post/How_to_integrate_new_data_in_a_TSNE_map
One suggestion from that discussion is to "train a multivariate regressor to predict the map location from the input data." That's at least straightforward to implement...we could see how it works in practice.
BTW, solving the transform
issue is going to be important for supporting streaming data-- I think we'll end up wanting to train the mapping onto a low-D space using some initial data, and then we can apply that mapping to new data.
And solving the inverse
issue will be important for doing analyses like the Raiders analysis from our paper entirely within the toolbox-- i.e. applying a (different) set of transformations to two datasets, and then using the inverse of one of those sets of transformations to map one dataset back onto the original high-dimensional space of the other dataset.
The new
DataGeometry
objects allow users to apply the same sequences of operations (that were applied to old data) to new data. However, the parameters of those transformations are currently fit anew for the new data. Therefore it is still inelegant to map new data onto the same space as old data without refitting the model.The fundamental challenge is that it is not always a straightforward process to fit the transformation parameters and then map new data onto a lower dimensional space. For example, the scikit learn dimensionality reduction functions (which
hypertools.reduce
wraps) don't have fully consistent APIs.The two basic functions that would be useful to expose are:
model.transform
: use a fitted model to map new data onto a lower dimensional spacemodel.inverse
: take a low-dimensional representation and map it back onto the original high-dimensional feature spaceIn the long run, it would be nice to support both of these operations for all of the dimensionality reduction models we support in
hypertools.reduce
. However, as a compromise that might be relatively easy to implement, I suggest that we extend theDataGeometry
object to includetransform
andinverse
functions. We should (initially) only support the simplest cases where:transform
is set to eithermodel.transform
(if thetransform
function is supported for that model in scikit-learn) or a null "identity" function that just returns whatever data it's passed and outputs a warning message (if thetransform
function is unsupported for that model in scikit-learn, or if normalization or alignment have been applied to the data)inverse
is set to eithermodel.inverse_transform
(if supported for that model in scikit-learn) or the null identity function (if theinverse_transform
function is unsupported for that model in scikit-learn, or if normalization or alignment have been applied to the data)I've compiled a list of which functions support the
transform
andinverse_transform
methods: [LINK]Eventually we can also support normalized data (by saving the normalization parameters) and aligned data (by saving the alignment parameters). Essentially we need to save enough so that we can invert those transformations. I don't think this would be fundamentally difficult, but we just need to think through the right way to implement it.