Closed pks closed 4 years ago
I don't think that this is a good change in the code. You should rather use the non-tokenized version of OPUS with the -p raw
flag.
One note however, also the raw version of the JW300 is tokenized (which is also mentioned in the paper) -- I assume that "\<whitespace>\<zero-width-whitespace>\<whitespace>" should be replaced by an empty string, while just "\<whitespace>" between tokens should stay a whitespace character.
Edit: That doesn't really seem to do the trick, all full-width punctuation is still surrounded by spaces.
True. That is a problem of that dataset. It was delivered in a tokenized version and detokenization didn't really work. We should fix that in future releases somehow. It is, indeed, a bit annoying that I don't have raw untokenized data for that corpus.
Noticed that whitespace was introduced on the Japanese side when running the following command: