Closed cbarrick closed 6 years ago
What docs
are you referring to? I think I can do documentation?
The Preprocess
class just needs documentation about how it works. I should probably do that since I wrote the code. I may make some more tweaks anyway before I put it up for review.
This is really Pipeliney :laughing:
If you're going to support arbitrary estimators / transformers then Preprocess
could be abused as a full pipeline.
What do you think about moving most of this logic to a Pipeline
base class?
Preprocess
would extend Pipeline
and add the shortcut functions ngram
, tf
, idf
, ...
We'll probably wind up adding a Postprocess
class as part of #18 and it could operate in a similar manner.
One problem is that this class is designed to manipulate a single 'features' column in-place. My thought was that this could help free memory sooner. This is fine for preprocess type estimators, but things like Naive Bayes spit out multiple columns.
Honestly, this may be overkill. It's getting so similar to ml.Pipeline
that we may want to use that instead. I think the only way this can be useful over pipelines is to have a targeted purpose, like preprocessing. I don't know if pipelines allow us to drop intermediate features to save memory. I also don't know if dropping intermediate features helps.
Maybe we should just use pipelines. I'll test out that workflow.
I put this up for review. We need to think carefully if we want to use this or ml.Pipeline
.
Whoops. I think I broke it in f953418. Will fix.
After looking at Pipelines for a bit, I think I prefer this API. Things like optional idf is more convoluted than it should be with Pipelines, but pretty easy with this.
I also like the split between the preprocess pipeline and the model itself. That's the point at which you want the most flexibility anyway. So embracing that split and focusing on preprocessing gives us a chance to make a more specialized and easy-to-use API.
As part of the fix, I'm also trying to slim it down so we have less code to maintain.
I agree. I think we should keep this.
I added a few reexports to save some typing.
I added a Preprocessor class with n-gram support. It ended up being a lot like
ml.Pipeline
:/At least it gives us a place to collect all of our preprocess logic.
It needs docs before I can put it up for review.
TODO