DRY NN Tokenizers - Githubissues

BILSTMC3GClassifier and VocabularyDumper uses Lucene tokenizers (via DataUtilities), while LSTMClassifier uses dl4j tokenizers.

Lucene uses the Unicode Text Segmentation algorithm (http://unicode.org/reports/tr29/)
dl4j: StringCleaning.stripPunct(token).toLowerCase();
Stanford's CoreNLP seems the most similar to nltk, used to train BioSentVec.

DRY.

Probably choose the one with highest coverage rate in BioSentVec (this has to be checked against the .vec file).

bst-mug / n2c2