Naive bayes - Githubissues

dsp-uga / elizabeth

Scalable malware detection

MIT License

0 stars 0 forks source link

Naive bayes #12

Closed cbarrick closed 6 years ago

cbarrick commented 6 years ago

This is a simple naive Bayes implementation based on pyspark.ml.

The entry point is super basic and needs further improvements. But those improvements are not specific to Naive Bayes, so I'm OK leaving those for a future PR. The way it is now is convenient for testing.

TODO:

[x] Implement Naive Bayes with TF-IDF.
[x] Test on GCP.
[x] Add an entry point for the submit script.

zachdj commented 6 years ago

I can give this a run later this eve. Can you run the test set locally or no?

zachdj commented 6 years ago

I'm probably going to create a new PR for the third TODO. Might as well go ahead and write a full-blown driver

cbarrick commented 6 years ago

We did some fair testing already, but we should double check that it runs as-is with this command:

$ ./scripts/submit.sh nb \
    gs://my_bucket/X_tiny_train.txt \
    gs://my_bucket/y_tiny_train.txt \
    gs://my_bucket/X_tiny_test.txt

or locally with:

$ python -m elizabeth nb \
    --base ./data \
    ./data/X_tiny_train.txt \
    ./data/y_tiny_train.txt \
    ./data/X_tiny_test.txt

Local testing might run out of memory. GCP should not.