dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

Merge split_bytes and split_asm into load_data? #13

Closed cbarrick closed 6 years ago

cbarrick commented 6 years ago

I think we'll always do tokenization. So it makes sense to fold the tokenization functions (split_bytes and split_asm) into the load_data function.

This change would turn data loading from a two-line operation to a one-liner.

I'll work on this tomorrow unless y'all object.

zachdj commented 6 years ago

The only concern that I have is that we may want to do the splitting different ways for different models. For example, I could see us wanting to do n-gram splitting for naive bayes, which would might be less convenient if tokenization was wrapped in with data loading

cbarrick commented 6 years ago

Spark provides an n-gram transformer that uses pre-tokenized text as input. https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.NGram

zachdj commented 6 years ago

Oh yay! In that case, no objections