Closed caffeine96 closed 2 years ago
Can you share the segmentation outputs for this example (as well as the Gujarati example) you shared over mail? Please share the text (not the images)?
import transformers
# instead of this : tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
# print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns True if you use above line
# use this:
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False
^ use this snippet to initialize the tokenizer to preserve accents or diacritics
This is explained in this issue https://github.com/AI4Bharat/indic-bert/issues/26 (There is also a note to this on our readme section in case you missed it)
Please let us know if this works
Thanks for pointing that out. That solves the issues with both Hindi and Gujarati.
I was working recently with the IndicBERT SentencePiece tokenizer and found something which I was curious about. It turns out that when we encode sentences, a good amount of diacritics do not get encoded. So for example, in Hindi, the sentences - "मेंने उसकी गेंद दी।" and "मैने उसको गेंद दी।" have the same encodings despite one having the genitive and the other the dative marker. I have seen this for Gujarati and Hindi. The reason I think the diacritics are ignored is that when the encodings are decoded, some diacritics are missing.
I was curious to know why this happens and if there is a work-around.