Minimal but functional text preprocessing pipeline
Runs as own script right now
Stemmers/lemmatizers are now transformer classes for ease of reuse
Lots of other refactoring/simplification
unused text cleaning functions removed for the moment
To do:
Proper docstrings
incorporate coding exceptions functions somehow
In the process of rewriting and simplifying word split functions
incorporate string cleaning functions in to a "cleaner" class, if that isn't already being handled by some of the nltk tokenising functions we're using everywhere
To do: