Currently, if you want to repeatedly transform text samples with hypertools.tools.format_data() using the same parameters, the function re-fits both the vectorizer and text model on each call. This ends up being fairly inefficient, and for expensive/numerous operations, makes working directly with the underlying sklearn classes the better option.
We could add an argument to return the fit models for reuse, but a really nice feature would be something like a scikit-learn Pipeline object that you could create, fit, save, and reuse to perform various processing steps with a single call. This would also be a very attractive feature for hypertools, since it could also additionally implement methods like .plot() and .describe().
Currently, if you want to repeatedly transform text samples with
hypertools.tools.format_data()
using the same parameters, the function re-fits both the vectorizer and text model on each call. This ends up being fairly inefficient, and for expensive/numerous operations, makes working directly with the underlyingsklearn
classes the better option.We could add an argument to return the fit models for reuse, but a really nice feature would be something like a scikit-learn Pipeline object that you could create, fit, save, and reuse to perform various processing steps with a single call. This would also be a very attractive feature for hypertools, since it could also additionally implement methods like
.plot()
and.describe()
.