dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

Memory Performance #14

Closed cbarrick closed 6 years ago

cbarrick commented 6 years ago

The current preprocessor uses an extreme amount of memory to load in data.

This is a tracking issue for how we might get memory usage down.

cbarrick commented 6 years ago

Streaming is not the solution we were looking for. I'm repurposing the issue for general memory performance.

cbarrick commented 6 years ago

@zachdj made great progress in #15 at the cost of API.

  1. Those changes remove the notion of order among the tokens, reducing us to a bag of words. This is fine for Naive Bayes but is bad for LSTMs.
  2. It requires some funky workarounds involving dropping down to the RDD level when doing things like TF-IDF.

So far it's our best option.

cbarrick commented 6 years ago

It appears that the UDFs were a big part of the problem. The builtin RegexTokenizer seems to handle the whole text files without a problem.

zachdj commented 6 years ago

I'll see if I can run the small dataset locally