Switch to pyspark.ml.feature.RegexTokenizer

dsp-uga / elizabeth

Scalable malware detection

MIT License

0 stars 0 forks source link

Switch to pyspark.ml.feature.RegexTokenizer #17

Closed cbarrick closed 6 years ago

cbarrick commented 6 years ago

This seems to solve our performance problems without forcing us to drop down to RDDs. This is the first time I've been able to run a fit locally! This also removes way more code than it adds :)

When testing against the tiny set, all of the outputs are class 0. The problem is that class 0 never appears in the actual labels. So Spark is clearly relabeling the classes. If it's just turning class labels into indices in sorted order, we'll be fine. We should test on the small set to be sure.

This PR also implements proper output for Naive Bayes. If you call it with 4 arguments (train_x, train_y, test_x, and test_y) it will print an accuracy score. If you call it with 3 arguments (i.e. without test_y) it will print predicted labels.

Fixes #14

cbarrick commented 6 years ago

I should note that this changes the '??' behavior (#6). Now '??' tokens are ignored individually rather than the entire line. This may be unwanted since bytes that don't actually occur next to each other will appear as if they do.

cbarrick commented 6 years ago

I added code to make IDF optional. Without IDF the predicted output for the tiny set seems reasonable. The labels are still wrong, but there is a 1-to-1 mapping between the predicted labels and true labels.

The mapping seems to be indices for the sorted set of true labels, starting at 0. Since the true labels start at 1, and the small set contains instances of every label, the mapping from predicted to true label is simply an off-by-one. I'll add a commit to change that. Though the tiny set will still give a low accuracy because it does not contain instances of each.

True    Predicted
6   2
3   1
1   0
7   3
1   0
1   0
6   2
3   1
3   1
7   3

zachdj commented 6 years ago

I think I saw somewhere in the pyspark.ml.NaiveBayes docs that it assumes the class labels are integers beginning with 0. However, the class labels in our dataset begin at 1. That's probably why they're off-by-one

zachdj commented 6 years ago

I think I saw somewhere in the pyspark.ml.NaiveBayes docs that it assumes the class labels are integers beginning with 0. However, the class labels in our dataset begin at 1. That's probably why they're off-by-one

Actually that feels kind of hacky to me. We could use StringIndexer to assign numeric indices to the provided labels. It might be overkill for this project, since the labels are numeric anyway. What do you think?

cbarrick commented 6 years ago

I don't think the use of StringIndexer should block this PR. But I wouldn't object to a future PR. But you're right that it's probably overkill for this project.

We should probably be fixing the labels on the preprocessor side though I'll quickly move that logic over there. It's just one line :p

cbarrick commented 6 years ago

We should probably be fixing the labels on the preprocessor side though I'll quickly move that logic over there. It's just one line :p

Nevermind. We need to fix it on the output side to appease AutoLab.