Closed patelrajnath closed 4 years ago
Do you mean Malayalam text in Telugu text is removed when the Telugu text is normalized?
No its in Malayalam text only..My mistake.. heading should be "Code Normalization error for Malayalam", will change it. On 16-Jan-2016 11:46 PM, "Anoop Kunchukuttan" notifications@github.com wrote:
Do you mean Malayalam text in Telugu text is removed when the Telugu text is normalized?
— Reply to this email directly or view it on GitHub https://github.com/anoopkunchukuttan/indic_nlp_library/issues/7#issuecomment-172240501 .
Randomly stumbled upon this repo. Was skimming through the Normalizer code.
I see that you are striping the ZWJ
and ZWNJ
characters. This is a bug. the ZWJ
and ZWNJ
are an inherent part of the text and shouldn't be removed. Removal of these characters will alter the text entirely and will have serious implications in areas like Search, Sort etc.
For Malayalam, ZWJ and ZWNJ are needed for the chillu characters only. However, the chillu characters also have their own codepoints. So, I first convert chillu representations to these codepoints before deleting the remaining ZWJ and ZWNJ. This does not semantically alter the text. I don't know the impact on search/sort - but I guess it is better to have a consistent representation is better than having multiple representations. I hope that addresses your concern. Do point out if there is anything I am missing.
For Malayalam, ZWJ and ZWNJ are needed for the chillu characters only
This is not correct. ZWJ is used to form Chillus and to force C2-conjoining forms ( This is applicable for all Indian languages), ZWNJ is used to indicate the explicit halant. removing ZWJ & ZWNJ can produce erroneous text (For eg: താഴ്വാരം instead of താഴ്വാരം).
I don't know the impact on search/sort
Okay, you might get the results if you strip the characters from both the target text and the query string, but the result might not be accurate because of the above-mentioned behaviour. (I don't have any examples ready with me to cite here though)
but I guess it is better to have a consistent representation is better than having multiple representations.
I totally agree with this. So just converting the Chillus to one form should do the trick.
ok, thanks. So, താഴ്വാരം instead of താഴ്വാരം are semantically the same. The ZWNJ only controls formatting in this case. The chillu is the only case, as I know, where ZWJ actually alters the meaning. That has been handled as mentioned above. The goal of this normalization is to ensure similar representation for similar words for NLP applications. We don't seek to retain formatting characters.
താഴ്വാരം is non-existent in the dictionary. To cite a more clear example, consider the following case.
These pairs have difference in meaning only with the difference of zwnj
for each of these words, the only results Google gives are discussions on this issue :) Looks like a case of schwa deletion to me. I don't know how prevalent the use of ZWNJ is to address schwa deletion issues in Malayalam - although it should not have been done in the first place. Other languages have accepted schwa deletion problems in the script - wonder why these hacky solutions are being proposed for Malayalam.
Till I understand this issue further, I will stick to the current approach in indic_nlp_library: for chillus convert to atomic codepoints and remove remaining ZWJ/ZWNJ.
Closing since issue pointed by Rajnath is not clear
The following Malayalam text is being removed when normalized.
ദക്ഷിണാഫ്രിക്കയിലെ സെന്റര് മൗണ്റ്റേന്സിലെ ബുഷ്മ്യാന്സ് ക്ല്യൂഫിനെ ഏറ്റവും നല്ല ഹോട്ടല് , സിംഗപൂര് എയര്ലൈന്സിനെ ഏറ്റവും നല്ല അന്താരാഷ്ട്റ വിമാനം , വെര്ജിന് അമേരിക്കയെ ഏറ്റവും ശ്രേഷ്ഠമായ സ്വകാര്യ വിമാനം, ക്രിസ്റ്റല് ക്രൂസിനെ ഏറ്റവും നല്ല ക്രൂസ് ലൈന് ( വലിയ കപ്പല് ) യആട്ട് ഓഫ് സീബോണിനെ ഏറ്റവും ശ്രേഷ്ഠമായ ക്രൂന് ലൈന് ( ചെറിയ കപ്പല് ) എന്നിവയായി പ്രഖ്യാപിച്ചു .
kindly check for the same. Thank you.