Closed cbarrick closed 6 years ago
I can give this a run later this eve. Can you run the test set locally or no?
I'm probably going to create a new PR for the third TODO. Might as well go ahead and write a full-blown driver
We did some fair testing already, but we should double check that it runs as-is with this command:
$ ./scripts/submit.sh nb \
gs://my_bucket/X_tiny_train.txt \
gs://my_bucket/y_tiny_train.txt \
gs://my_bucket/X_tiny_test.txt
or locally with:
$ python -m elizabeth nb \
--base ./data \
./data/X_tiny_train.txt \
./data/y_tiny_train.txt \
./data/X_tiny_test.txt
Local testing might run out of memory. GCP should not.
This is a simple naive Bayes implementation based on
pyspark.ml
.The entry point is super basic and needs further improvements. But those improvements are not specific to Naive Bayes, so I'm OK leaving those for a future PR. The way it is now is convenient for testing.
TODO: