Closed sweta20 closed 2 years ago
It looks like it's related to this bug in sacremoses. It seems like it's been fixed, but perhaps we need to test from the main branch since a new release of sacremoses hasn't been cut since this fix.
Using main:
>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='hi')
>>> md = MosesDetokenizer(lang='hi')
>>> s = "क्या आपको लगता है कि उनकी गोलीबारी जायज थी क्योंकि उन्होंने पहले माफी मांग ली थी?"
>>> # Test roundtrip
>>> md.detokenize(mt.tokenize(s)) == s
True
It seems like there's some compensation we'll need to make to detokenize Hindi correctly so hopefully the task organizers can help us craft a perl script or similar to do this unless they can train a new baseline using the main branch of sacremoses 😅
Thanks for flagging! We plan on training a new en-hi baseline model using the updated main branch of sacremoses and will release that once its done.
For En-Hi, the provided baseline model generates:
instead of
Is there a correct way to detokenize 1 to generate 2? Also, what was the pre-processing used when training the baseline models?