Closed cbarrick closed 6 years ago
The output of load_data and load_labels are now DataFrames. Each row contains the full text, not separate lines.
load_data
load_labels
The split_bytes and split_asm functions are now Spark User Defined Functions (UDFs), meaning we can use them in DataFrame expressions.
split_bytes
split_asm
elizabeth.context is gone in favor of a new elizabeth.session.
elizabeth.context
elizabeth.session
Introduces a new @elizabeth.udf(dtype) decorator for defining UDFs.
@elizabeth.udf(dtype)
Here's how I use them:
>>> import elizabeth >>> data = elizabeth.preprocess.load_data('./data_tiny/X_tiny_train.txt', base='./data_tiny', kind='bytes') >>> data DataFrame[id: bigint, url: string, text: string] >>> data = data.withColumn('bytes', elizabeth.preprocess.split_bytes(data.text)) >>> data DataFrame[id: bigint, url: string, text: string, bytes: array<bigint>]
I just noticed the docs for load_data and load_labels are out of date.
The output of
load_data
andload_labels
are now DataFrames. Each row contains the full text, not separate lines.The
split_bytes
andsplit_asm
functions are now Spark User Defined Functions (UDFs), meaning we can use them in DataFrame expressions.elizabeth.context
is gone in favor of a newelizabeth.session
.Introduces a new
@elizabeth.udf(dtype)
decorator for defining UDFs.Here's how I use them: