dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

Switch to DataFrames #11

Closed cbarrick closed 6 years ago

cbarrick commented 6 years ago

The output of load_data and load_labels are now DataFrames. Each row contains the full text, not separate lines.

The split_bytes and split_asm functions are now Spark User Defined Functions (UDFs), meaning we can use them in DataFrame expressions.

elizabeth.context is gone in favor of a new elizabeth.session.

Introduces a new @elizabeth.udf(dtype) decorator for defining UDFs.

Here's how I use them:

>>> import elizabeth
>>> data = elizabeth.preprocess.load_data('./data_tiny/X_tiny_train.txt', base='./data_tiny', kind='bytes')
>>> data
DataFrame[id: bigint, url: string, text: string]
>>> data = data.withColumn('bytes', elizabeth.preprocess.split_bytes(data.text))
>>> data
DataFrame[id: bigint, url: string, text: string, bytes: array<bigint>]
cbarrick commented 6 years ago

I just noticed the docs for load_data and load_labels are out of date.