Small fixes to handle unicode characters

StanfordHCI / termite

(development moved to new repos)

BSD 3-Clause "New" or "Revised" License

115 stars 36 forks source link

Small fixes to handle unicode characters #27

Closed elmer-garduno closed 10 years ago

elmer-garduno commented 10 years ago

This change makes fixes some problems with unicode tokenization. In particular the regex that was used on UNICODE_TOKENIZATION and a cast to str() when writing the file.

Fixes small typos on the cli options.

This change also makes UNICODE_TOKENIZATION, the default tokenizer.