Code normalization error for Malayalam

anoopkunchukuttan / indic_nlp_library

Resources and tools for Indian language Natural Language Processing

http://anoopkunchukuttan.github.io/indic_nlp_library/

MIT License

549 stars 160 forks source link

Code normalization error for Malayalam #7

Closed patelrajnath closed 4 years ago

patelrajnath commented 8 years ago

The following Malayalam text is being removed when normalized.

ദക്ഷിണാഫ്രിക്കയിലെ സെന്‍റര്‍ മൗണ്‍റ്റേന്‍സിലെ ബുഷ്മ്യാന്‍സ് ക്ല്യൂഫിനെ ഏറ്റവും നല്ല ഹോട്ടല്‍ , സിംഗപൂര്‍ എയര്‍ലൈന്‍സിനെ ഏറ്റവും നല്ല അന്താരാഷ്ട്റ വിമാനം , വെര്ജിന്‍ അമേരിക്കയെ ഏറ്റവും ശ്രേഷ്ഠമായ സ്വകാര്യ വിമാനം, ക്രിസ്റ്റല്‍ ക്രൂസിനെ ഏറ്റവും നല്ല ക്രൂസ് ലൈന്‍ ( വലിയ കപ്പല്‍ ) യആട്ട് ഓഫ് സീബോണിനെ ഏറ്റവും ശ്രേഷ്ഠമായ ക്രൂന്‍ ലൈന്‍ ( ചെറിയ കപ്പല്‍ ) എന്നിവയായി പ്രഖ്യാപിച്ചു .

kindly check for the same. Thank you.

anoopkunchukuttan commented 8 years ago

Do you mean Malayalam text in Telugu text is removed when the Telugu text is normalized?

patelrajnath commented 8 years ago

No its in Malayalam text only..My mistake.. heading should be "Code Normalization error for Malayalam", will change it. On 16-Jan-2016 11:46 PM, "Anoop Kunchukuttan" notifications@github.com wrote:

Do you mean Malayalam text in Telugu text is removed when the Telugu text is normalized?

— Reply to this email directly or view it on GitHub https://github.com/anoopkunchukuttan/indic_nlp_library/issues/7#issuecomment-172240501 .

stultus commented 4 years ago

Randomly stumbled upon this repo. Was skimming through the Normalizer code. I see that you are striping the ZWJ and ZWNJ characters. This is a bug. the ZWJ and ZWNJ are an inherent part of the text and shouldn't be removed. Removal of these characters will alter the text entirely and will have serious implications in areas like Search, Sort etc.

anoopkunchukuttan commented 4 years ago

For Malayalam, ZWJ and ZWNJ are needed for the chillu characters only. However, the chillu characters also have their own codepoints. So, I first convert chillu representations to these codepoints before deleting the remaining ZWJ and ZWNJ. This does not semantically alter the text. I don't know the impact on search/sort - but I guess it is better to have a consistent representation is better than having multiple representations. I hope that addresses your concern. Do point out if there is anything I am missing.

stultus commented 4 years ago

For Malayalam, ZWJ and ZWNJ are needed for the chillu characters only

This is not correct. ZWJ is used to form Chillus and to force C2-conjoining forms ( This is applicable for all Indian languages), ZWNJ is used to indicate the explicit halant. removing ZWJ & ZWNJ can produce erroneous text (For eg: താഴ്വാരം instead of താഴ്‌‌വാരം).

I don't know the impact on search/sort

Okay, you might get the results if you strip the characters from both the target text and the query string, but the result might not be accurate because of the above-mentioned behaviour. (I don't have any examples ready with me to cite here though)

but I guess it is better to have a consistent representation is better than having multiple representations.

I totally agree with this. So just converting the Chillus to one form should do the trick.

anoopkunchukuttan commented 4 years ago

ok, thanks. So, താഴ്വാരം instead of താഴ്‌‌വാരം are semantically the same. The ZWNJ only controls formatting in this case. The chillu is the only case, as I know, where ZWJ actually alters the meaning. That has been handled as mentioned above. The goal of this normalization is to ensure similar representation for similar words for NLP applications. We don't seek to retain formatting characters.

stultus commented 4 years ago

താഴ്വാരം is non-existent in the dictionary. To cite a more clear example, consider the following case.

സദ്‌വാരം - (good week) - 0D38 0D26 0D4D 200C 0D35 0D3E 0D30 0D02
സദ്വാരം - (with hole) - 0D38 0D26 0D4D 0D35 0D3E 0D30 0D02

These pairs have difference in meaning only with the difference of zwnj

anoopkunchukuttan commented 4 years ago

for each of these words, the only results Google gives are discussions on this issue :) Looks like a case of schwa deletion to me. I don't know how prevalent the use of ZWNJ is to address schwa deletion issues in Malayalam - although it should not have been done in the first place. Other languages have accepted schwa deletion problems in the script - wonder why these hacky solutions are being proposed for Malayalam.

anoopkunchukuttan commented 4 years ago

Till I understand this issue further, I will stick to the current approach in indic_nlp_library: for chillus convert to atomic codepoints and remove remaining ZWJ/ZWNJ.

anoopkunchukuttan commented 4 years ago

Closing since issue pointed by Rajnath is not clear