ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.83k stars 160 forks source link

change procrustes and reduce API to look like scikit learn #98

Closed andrewheusser closed 7 years ago

andrewheusser commented 7 years ago

since many are familiar with the scikit-learn fit-transform design, we could change the procrustes and reduce apis to have a similar design. This would also allow transforms to be fit on one dataset and applied to another.

jeremymanning commented 7 years ago

@andrewheusser can you clarify what help you're looking for here?

andrewheusser commented 7 years ago

essentially, we would extend the procrustes and reduce (and maybe align?) APIs to return a "fit model". Scikit-learn is set up like this:

m = model()
m.fit(data)
transformed_data = m.transform(data)
#or
transformed_data = m.fit_transform(data)

This allows you to pass new data to the model fit with another dataset. Allowing behavior like this would help us in cases where we want to fit data to a precomputed model for cross-validation, or other purposes.

one idea would be to keep the API as we have it currently, but extend its functionality:

from hypertools import tools
reduced_data = tools.reduce(data) # same as before
fit_model = tools.reduce.fit(data)
reduced_data = fit_model.transform(data)
jeremymanning commented 7 years ago

That sounds great to me!

andrewheusser commented 7 years ago

now that I'm in the weeds here, this is actually trickier than I thought. It looks like all of the scikit-learn decomposition algorithms (PCA, FastICA, NMF..) have a fit, transform and fit_transform method. However, the manifold learning algorithms (TSNE, MDS..) have just the fit and fit_transform methods (not transform alone). Thus, the transform method would only work for the decomposition style algorithms.

jeremymanning commented 7 years ago

perhaps we could provide a standard interface for these functions, even if scikit-learn doesn't. this could be really useful. what i'm thinking is:

reduced = hyp.tools.reduce(data, method='PCA', ndims=3) returns the reduced data. method can be one of: PCA, PPCA, ICA, NFM, MDS, or tSNE xform = hyp.tools.reduce(data, method='PCA', ndims=3, return_xform=True) returns a transform object, fit using data, that can be applied to any new dataset of the same shape as data (if data is a single matrix/dataframe) or of the same shape as any element of data (if data is a list of arrays/dataframes).

then, given xform, we could get the reduced data using: reduced = xform.apply(new_data), where new_data could be a list of matrices, a single matrix, etc.

we will probably need to manually define all of these functions (e.g. we can't just use a common interface to scikit-learn), since it sounds like they're all implemented differently. we may also need to find other existing libraries that provide these algorithms and/or implement some ourselves.

benefits of this design:

other considerations:

jeremymanning commented 7 years ago

ah...i think there's some mixup in these comments with issue 106. this issue is about hyperalignment; 106 is about data reduction.

jeremymanning commented 7 years ago

This issue is now redundant with this one. Closing...