dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

Scalable nb #15

Closed zachdj closed 6 years ago

zachdj commented 6 years ago

The current code for naive bayes works on whole files. I think this is preventing Spark from creating suitably small partitions, which is causing explosive memory consumption and runtime.

I have modified our naive bayes code to work on files that are split into lines. Based on my experiments running locally, this heavily reduces memory usage (from 12gb down to ~1.5gb) and runtime on a tiny test dataset. Incidentally, the code also works when given whole files.

I've added a function to the preprocessor that loads data from a manifest file into a DataFrame of tokenized lines. I combined data loading and tokenization into one step as discussed in #13. I also added a function to core which computes the sum of two SparseVectors.

More work is required to compute TFIDF scores in naive_bayes since we have to compute term frequency for each line, then sum up all the lines in a document. I was unable to find an efficient way to mimic reducing by key using DataFrames, so I had to revert to RDD operations when summing frequency vectors.

This probably isn't production-ready yet, but I want to get the PR started and solicit feedback

zachdj commented 6 years ago

I should note that I tried doing the reduceByKey operation within the DataFrame API, but found it very difficult to accomplish without using extremely expensive joins. If someone can suggest a more DataFrame-y alternative to what I'm doing then that might be an improvement from an engineering POV

whusym commented 6 years ago

I don't know how expensive it would be. But what about using VectorSlicer and slice the elements in the vector into separate columns, and after we add them all together, we use VectorAssembler to assemble these elements back to a vector on each row?

or maybe using a user-defined function?

zachdj commented 6 years ago

Hmm that’s an interesting idea. It might work for now since we’re dealing with small (length 512) feature vectors, but I’m skeptical how well it would scale up if we using ngramming.

Even with just bigrams it would involve creating > 65000 new columns in the DataFrame.

Feel free to give it a try if you’d like. However, if it results in a big performance hit then I think we should just tolerate the bit of RDD ugliness

Based on my (limited) knowledge, UDFs are helpful for map-like operations on a dataframe, but I don't know if we can get reduce-like behavior with them.

On Sun, Feb 11, 2018 at 12:05 PM JeremyS notifications@github.com wrote:

I don't know how expensive it would be. But what about using VectorSlicer and slice the elements in the vector into separate columns, and after we add them all together, we use VectorAssembler to assemble these elements back to a vector on each row?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dsp-uga/elizabeth/pull/15#issuecomment-364767698, or mute the thread https://github.com/notifications/unsubscribe-auth/AGxfUJ_RXBfzdwFOML3DTvFyiblNpENtks5tTx3bgaJpZM4SBT8q .

zachdj commented 6 years ago

Apparently, it is possible to create user-defined aggregate functions, but you have to write the UDAF in Scala then wrap it in Python. I think that would be some serious over-engineering

cbarrick commented 6 years ago

Will streaming help?

I'd rather fix performance at the DatFrame layer than at the model layer.

We shouldn't need to drop down to RDDs for every model.

zachdj commented 6 years ago

@cbarrick My intuition on this is poor, so I'm not sure, but I think our problem is deeper than that. We should be able to train a model on 10 input files without using 12 gigs of memory.

zachdj commented 6 years ago

This has been resolved in a superior way by #17