This change makes fixes some problems with unicode tokenization. In particular the regex that was used on UNICODE_TOKENIZATION and a cast to str() when writing the file.
Fixes small typos on the cli options.
This change also makes UNICODE_TOKENIZATION, the default tokenizer.
This change makes fixes some problems with unicode tokenization. In particular the regex that was used on UNICODE_TOKENIZATION and a cast to str() when writing the file.
Fixes small typos on the cli options.
This change also makes UNICODE_TOKENIZATION, the default tokenizer.