Closed cbarrick closed 6 years ago
The only concern that I have is that we may want to do the splitting different ways for different models. For example, I could see us wanting to do n-gram splitting for naive bayes, which would might be less convenient if tokenization was wrapped in with data loading
Spark provides an n-gram transformer that uses pre-tokenized text as input. https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.NGram
Oh yay! In that case, no objections
I think we'll always do tokenization. So it makes sense to fold the tokenization functions (
split_bytes
andsplit_asm
) into theload_data
function.This change would turn data loading from a two-line operation to a one-liner.
I'll work on this tomorrow unless y'all object.