English-Hindi Tokenization for the baseline model

sweta20 commented 2 years ago

For En-Hi, the provided baseline model generates:

क ् या आपको लगता है कि उसकी गोलीबारी उचित थी क ् योंकि उसने पहले माफ ़ ी मांगी थी?

instead of

क्या आपको लगता है कि उनकी गोलीबारी जायज थी क्योंकि उन्होंने पहले माफी मांग ली थी?

Is there a correct way to detokenize 1 to generate 2? Also, what was the pre-processing used when training the baseline models?

erip commented 2 years ago

It looks like it's related to this bug in sacremoses. It seems like it's been fixed, but perhaps we need to test from the main branch since a new release of sacremoses hasn't been cut since this fix.

Using main:

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='hi')
>>> md = MosesDetokenizer(lang='hi')
>>> s = "क्या आपको लगता है कि उनकी गोलीबारी जायज थी क्योंकि उन्होंने पहले माफी मांग ली थी?"
>>> # Test roundtrip
>>> md.detokenize(mt.tokenize(s)) == s
True

It seems like there's some compensation we'll need to make to detokenize Hindi correctly so hopefully the task organizers can help us craft a perl script or similar to do this unless they can train a new baseline using the main branch of sacremoses 😅

bhsu22 commented 2 years ago

Thanks for flagging! We plan on training a new en-hi baseline model using the updated main branch of sacremoses and will release that once its done.

bhsu22 commented 2 years ago

We've released an updated baseline en-hi model trained using the updated sacremoses tokenizers. You can find the model in our releases here.

amazon-science / contrastive-controlled-mt

English-Hindi Tokenization for the baseline model #3