Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

Improve handling whitespace in Jieba and MeCab tokenization #50

Closed svirpioj closed 2 years ago

svirpioj commented 2 years ago

Add new option to map original space characters before tokenization (and back after detokenization) in order to keep track of them.