Automatically switch token delimiter for languages not using whitespace

Helsinki-NLP / OpusTools

67 stars 17 forks source link

Automatically switch token delimiter for languages not using whitespace #10

Closed pks closed 4 years ago

pks commented 4 years ago

Noticed that whitespace was introduced on the Japanese side when running the following command:

opus_read -d JW300 -v -s en -t ja -wm moses -w jw300.en jw300.ja

jorgtied commented 4 years ago

I don't think that this is a good change in the code. You should rather use the non-tokenized version of OPUS with the -p raw flag.

pks commented 4 years ago

One note however, also the raw version of the JW300 is tokenized (which is also mentioned in the paper) -- I assume that "\<whitespace>\<zero-width-whitespace>\<whitespace>" should be replaced by an empty string, while just "\<whitespace>" between tokens should stay a whitespace character.

Edit: That doesn't really seem to do the trick, all full-width punctuation is still surrounded by spaces.

jorgtied commented 4 years ago

True. That is a problem of that dataset. It was delivered in a tokenized version and detokenization didn't really work. We should fix that in future releases somehow. It is, indeed, a bit annoying that I don't have raw untokenized data for that corpus.