We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)
However it would be better to rely on the tokenizer to split the text in tokens
it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)
It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).
We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)
However it would be better to rely on the tokenizer to split the text in tokens
it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)
It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).