Switch to DataFrames - Githubissues

The output of load_data and load_labels are now DataFrames. Each row contains the full text, not separate lines.

The split_bytes and split_asm functions are now Spark User Defined Functions (UDFs), meaning we can use them in DataFrame expressions.

elizabeth.context is gone in favor of a new elizabeth.session.

Introduces a new @elizabeth.udf(dtype) decorator for defining UDFs.

Here's how I use them:

>>> import elizabeth
>>> data = elizabeth.preprocess.load_data('./data_tiny/X_tiny_train.txt', base='./data_tiny', kind='bytes')
>>> data
DataFrame[id: bigint, url: string, text: string]
>>> data = data.withColumn('bytes', elizabeth.preprocess.split_bytes(data.text))
>>> data
DataFrame[id: bigint, url: string, text: string, bytes: array<bigint>]

dsp-uga / elizabeth

Switch to DataFrames #11