Closed cbarrick closed 6 years ago
I should note that this changes the '??' behavior (#6). Now '??' tokens are ignored individually rather than the entire line. This may be unwanted since bytes that don't actually occur next to each other will appear as if they do.
I added code to make IDF optional. Without IDF the predicted output for the tiny set seems reasonable. The labels are still wrong, but there is a 1-to-1 mapping between the predicted labels and true labels.
The mapping seems to be indices for the sorted set of true labels, starting at 0. Since the true labels start at 1, and the small set contains instances of every label, the mapping from predicted to true label is simply an off-by-one. I'll add a commit to change that. Though the tiny set will still give a low accuracy because it does not contain instances of each.
True Predicted
6 2
3 1
1 0
7 3
1 0
1 0
6 2
3 1
3 1
7 3
I think I saw somewhere in the pyspark.ml.NaiveBayes
docs that it assumes the class labels are integers beginning with 0. However, the class labels in our dataset begin at 1. That's probably why they're off-by-one
I think I saw somewhere in the pyspark.ml.NaiveBayes docs that it assumes the class labels are integers beginning with 0. However, the class labels in our dataset begin at 1. That's probably why they're off-by-one
Actually that feels kind of hacky to me. We could use StringIndexer
to assign numeric indices to the provided labels. It might be overkill for this project, since the labels are numeric anyway. What do you think?
I don't think the use of StringIndexer
should block this PR. But I wouldn't object to a future PR. But you're right that it's probably overkill for this project.
We should probably be fixing the labels on the preprocessor side though I'll quickly move that logic over there. It's just one line :p
We should probably be fixing the labels on the preprocessor side though I'll quickly move that logic over there. It's just one line :p
Nevermind. We need to fix it on the output side to appease AutoLab.
This seems to solve our performance problems without forcing us to drop down to RDDs. This is the first time I've been able to run a fit locally! This also removes way more code than it adds :)
When testing against the tiny set, all of the outputs are class 0. The problem is that class 0 never appears in the actual labels. So Spark is clearly relabeling the classes. If it's just turning class labels into indices in sorted order, we'll be fine. We should test on the small set to be sure.
This PR also implements proper output for Naive Bayes. If you call it with 4 arguments (
train_x
,train_y
,test_x
, andtest_y
) it will print an accuracy score. If you call it with 3 arguments (i.e. withouttest_y
) it will print predicted labels.Fixes #14