dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

A Preprocessor class #21

Closed cbarrick closed 6 years ago

cbarrick commented 6 years ago

I added a Preprocessor class with n-gram support. It ended up being a lot like ml.Pipeline :/

At least it gives us a place to collect all of our preprocess logic.

It needs docs before I can put it up for review.

TODO

whusym commented 6 years ago

What docs are you referring to? I think I can do documentation?

cbarrick commented 6 years ago

The Preprocess class just needs documentation about how it works. I should probably do that since I wrote the code. I may make some more tweaks anyway before I put it up for review.

zachdj commented 6 years ago

This is really Pipeliney :laughing: If you're going to support arbitrary estimators / transformers then Preprocess could be abused as a full pipeline.

What do you think about moving most of this logic to a Pipeline base class?

Preprocess would extend Pipeline and add the shortcut functions ngram, tf, idf, ... We'll probably wind up adding a Postprocess class as part of #18 and it could operate in a similar manner.

cbarrick commented 6 years ago

One problem is that this class is designed to manipulate a single 'features' column in-place. My thought was that this could help free memory sooner. This is fine for preprocess type estimators, but things like Naive Bayes spit out multiple columns.

Honestly, this may be overkill. It's getting so similar to ml.Pipeline that we may want to use that instead. I think the only way this can be useful over pipelines is to have a targeted purpose, like preprocessing. I don't know if pipelines allow us to drop intermediate features to save memory. I also don't know if dropping intermediate features helps.

Maybe we should just use pipelines. I'll test out that workflow.

cbarrick commented 6 years ago

I put this up for review. We need to think carefully if we want to use this or ml.Pipeline.

cbarrick commented 6 years ago

Whoops. I think I broke it in f953418. Will fix.

After looking at Pipelines for a bit, I think I prefer this API. Things like optional idf is more convoluted than it should be with Pipelines, but pretty easy with this.

I also like the split between the preprocess pipeline and the model itself. That's the point at which you want the most flexibility anyway. So embracing that split and focusing on preprocessing gives us a chance to make a more specialized and easy-to-use API.

As part of the fix, I'm also trying to slim it down so we have less code to maintain.

zachdj commented 6 years ago

I agree. I think we should keep this.

cbarrick commented 6 years ago

I added a few reexports to save some typing.