eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
53 stars 11 forks source link

revamp default space tokenization - review the ((newline)) thing #33

Open vince62s opened 3 months ago

vince62s commented 3 months ago

We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)

However it would be better to rely on the tokenizer to split the text in tokens

it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)

It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).