Open johnsamuelwrites opened 5 years ago
@johnsamuelwrites It is possible to support other languages. If there are any language you specifically want I can add them, otherwise I will need to make a bit more work to include languages generally.
See also #9
Please check PR https://github.com/fnielsen/ordia/pull/22. I have added support to three languages.
It now runs https://tools.wmflabs.org/ordia/text-to-lexemes
@johnsamuelwrites I wonder if you could check up on whether it works? I am not sure the word tokenization works for Hindi and Malayalam? - Or maybe it is because there are few words in the respective languages. The tokenization is fairly simple at the moment.
For instance, https://tools.wmflabs.org/ordia/text-to-lexemes?text-language=ml&text=%E0%B4%85%E0%B5%BB%E0%B4%AA%E0%B4%A4%E0%B5%8D%20%20%E0%B4%A8%E0%B4%BE%E0%B4%B2%E0%B5%8D%E0%B4%AA%E0%B4%A4%E0%B5%8D%20%20%E0%B4%AE%E0%B5%81%E0%B4%AA%E0%B5%8D%E0%B4%AA%E0%B4%A4%E0%B5%8D . It looks wrong to me.
Thanks @fnielsen. You're right. I will also take a look.
The word tokenization is just a regular expression: https://github.com/fnielsen/ordia/blob/master/ordia/text.py#L8 Is Devanagari diacritics might be an issue?
There is apparently some part of the Hindi Unicode that is not by default recognized as a word character
>>> [unicodedata.category(c) for c in "काशीपुर भारत"]
['Lo', 'Mc', 'Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Zs', 'Lo', 'Mc', 'Lo', 'Lo']
I wonder why you didn't type in the form of രണ്ട് at https://www.wikidata.org/wiki/Lexeme:L2389? The matching of text-to-lexemes happens with the forms.
I checked, it's tokenizing correctly. Thanks. I will add the forms.
For Bengali, #48
Currently https://tools.wmflabs.org/ordia/text-to-lexemes supports seven languages. Is it possible to include other languages? Or are you supporting the languages based on the number of lexemes in a language. So, the seven already chosen languages are the ones with the highest number of lexemes.