Ezhil-Language-Foundation / open-tamil

Open Source Tamil NLP Tools - தமிழ் இயற்கை மொழி பகுப்பாய்வு நிரல்தொகுப்பு
http://tamilpesu.us
MIT License
266 stars 82 forks source link

Issue with get_letters method #132

Closed tshrinivasan closed 6 years ago

tshrinivasan commented 6 years ago

import tamil a="ரிஷ’" tamil.utf8.get_letters(a) ['ரி', 'ஷ’']

That single quote ' is not considered as separate letter. Please fix this.

tshrinivasan commented 6 years ago

Few more example : ரஸ“மா , ரஹ“மான்

arulalant commented 6 years ago

Is it necessary to consider single quote as one of the characters? You fix it by yourself before passing into get_leters by replacing single/double quotes with empty char. a="ரிஷ’" a=a.replace("’", "")

arcturusannamalai commented 6 years ago

Thanks for reporting case @tshrinivasan - I think it should be possible to fix it. Thanks for work-around @arulalant !

tshrinivasan commented 6 years ago

https://github.com/nithyadurai87/tamil-sandhi-checker/issues/3

We have to replace the smart quotes with regular quotes.

Check the above issue.

2018-03-10 12:48 GMT+05:30 Muthiah Annamalai notifications@github.com:

Thanks for reporting case @tshrinivasan https://github.com/tshrinivasan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ezhil-Language-Foundation/open-tamil/issues/132#issuecomment-372009453, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNbON5zN6PKfWr9mRR3z5IsZThhCDTeks5tc35OgaJpZM4SjC9C .

-- Regards, T.Shrinivasan

My Life with GNU/Linux : http://goinggnu.wordpress.com Free E-Magazine on Free Open Source Software in Tamil : http://kaniyam.com

Get Free Tamil Ebooks for Android, iOS, Kindle, Computer : http://FreeTamilEbooks.com

arcturusannamalai commented 6 years ago

@tshrinivasan I checked in debugger - I have difficulty in reproducing issue. It seems the Python is not able to the represent the quote character. Can you send the unicode code-point version of the strings ?

for a in [u"ரிஷ ’",u"ரஸ “மா" , u"ரஹ “மான்"]: ... pprint.pprint(a) ... u'\u0bb0\u0bbf\u0bb7 \u2019' u'\u0bb0\u0bb8 \u201c\u0bae\u0bbe' u'\u0bb0\u0bb9 \u201c\u0bae\u0bbe\u0ba9\u0bcd'

for a in [u"ரிஷ’",u"ரஸ“மா" , u"ரஹ“மான்"]: ... pprint.pprint(a) ... u'\u0bb0\u0bbf\u0bb7\u2019' u'\u0bb0\u0bb8\u201c\u0bae\u0bbe' u'\u0bb0\u0bb9\u201c\u0bae\u0bbe\u0ba9\u0bcd'

arcturusannamalai commented 6 years ago

@tshrinivasan Can you try the fix bf4a29b40531ed688a5d7a9d06331b59995c9188 and if it resolves issue, add unitest and close issue ?

arcturusannamalai commented 6 years ago

I will be closing this issue as fix seems sufficient to me.