Tokenization doesn't preserve diacritics

AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT

https://indicnlp.ai4bharat.org

MIT License

276 stars 41 forks source link

Tokenization doesn't preserve diacritics #40

Closed caffeine96 closed 2 years ago

caffeine96 commented 2 years ago

I was working recently with the IndicBERT SentencePiece tokenizer and found something which I was curious about. It turns out that when we encode sentences, a good amount of diacritics do not get encoded. So for example, in Hindi, the sentences - "मेंने उसकी गेंद दी।" and "मैने उसको गेंद दी।" have the same encodings despite one having the genitive and the other the dative marker. I have seen this for Gujarati and Hindi. The reason I think the diacritics are ignored is that when the encodings are decoded, some diacritics are missing.

I was curious to know why this happens and if there is a work-around.

anoopkunchukuttan commented 2 years ago

Can you share the segmentation outputs for this example (as well as the Gujarati example) you shared over mail? Please share the text (not the images)?

gowtham1997 commented 2 years ago

import transformers
# instead of this : tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
# print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns True if you use above line
# use this:
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False

^ use this snippet to initialize the tokenizer to preserve accents or diacritics

This is explained in this issue https://github.com/AI4Bharat/indic-bert/issues/26 (There is also a note to this on our readme section in case you missed it)

Please let us know if this works

caffeine96 commented 2 years ago

Thanks for pointing that out. That solves the issues with both Hindi and Gujarati.