BengaliAI / taklaPreWork

"Takla" dataset Generation and Collection
0 stars 0 forks source link

Conversion between word and grapheme components #1

Open Reasat opened 3 years ago

Reasat commented 3 years ago

Write code that handles this. Also, add word cleaning functions to remove implausible unicode combinations.

mnansary commented 3 years ago

Where should these cleaning functions be bhai? Should they be separate from the parser pipeline or they should be included?
I am not sure where to add cleanup cases :

ক্রেডিেন্স --> ['ক্রে', 'ডি', 'ে', 'ন্স'] 'ে' is unwanted

  • Words that have same symbols repeated twice for no reason I had seen a word "রুপের" which breaks as ["র","ূ","ূ","প","ে","র"] This repeat "ূ","ূ" is unwanted
  • Words that have 'ৃ' in between vds/cds symbols . Imagine something like [,"ূ", 'ৃ' , "ে", "র"]

@Reasat @sushmit0109

Reasat commented 3 years ago

the first one is tough, should the e-kar stay or the e-kar

the second one is obvious, remove duplicate diacritic, but did it fail and break in "প","ে" too? that's weird

the third, what word is that, does removing the 'ৃ' fix it?

the clean function could be a separate one in the utils folder

mnansary commented 3 years ago

Bhai for the second one the graphemeparser did not fail , its just that the word breaks like that i.e - for ch in "ruper":print(ch) prints the "U-kar" twice Bhai I dont exactly remember that word but yes removing the 'ৃ' is a possible solution Also bhai i have encountered the following symbols: "৵","৶","৺ডাঃ" etc (trust me the first one is not '৯')