Support for other languages in Text-to-lexemes - Githubissues

fnielsen / ordia

Wikidata lexemes presentations

https://ordia.toolforge.org

Apache License 2.0

24 stars 13 forks source link

Support for other languages in Text-to-lexemes #20

Open johnsamuelwrites opened 5 years ago

johnsamuelwrites commented 5 years ago

Currently https://tools.wmflabs.org/ordia/text-to-lexemes supports seven languages. Is it possible to include other languages? Or are you supporting the languages based on the number of lexemes in a language. So, the seven already chosen languages are the ones with the highest number of lexemes.

fnielsen commented 5 years ago

@johnsamuelwrites It is possible to support other languages. If there are any language you specifically want I can add them, otherwise I will need to make a bit more work to include languages generally.

fnielsen commented 5 years ago

See also #9

johnsamuelwrites commented 5 years ago

Please check PR https://github.com/fnielsen/ordia/pull/22. I have added support to three languages.

fnielsen commented 5 years ago

It now runs https://tools.wmflabs.org/ordia/text-to-lexemes

@johnsamuelwrites I wonder if you could check up on whether it works? I am not sure the word tokenization works for Hindi and Malayalam? - Or maybe it is because there are few words in the respective languages. The tokenization is fairly simple at the moment.

For instance, https://tools.wmflabs.org/ordia/text-to-lexemes?text-language=ml&text=%E0%B4%85%E0%B5%BB%E0%B4%AA%E0%B4%A4%E0%B5%8D%20%20%E0%B4%A8%E0%B4%BE%E0%B4%B2%E0%B5%8D%E0%B4%AA%E0%B4%A4%E0%B5%8D%20%20%E0%B4%AE%E0%B5%81%E0%B4%AA%E0%B5%8D%E0%B4%AA%E0%B4%A4%E0%B5%8D . It looks wrong to me.

johnsamuelwrites commented 5 years ago

Thanks @fnielsen. You're right. I will also take a look.

fnielsen commented 5 years ago

The word tokenization is just a regular expression: https://github.com/fnielsen/ordia/blob/master/ordia/text.py#L8 Is Devanagari diacritics might be an issue?

fnielsen commented 5 years ago

There is apparently some part of the Hindi Unicode that is not by default recognized as a word character

>>> [unicodedata.category(c) for c in "काशीपुर भारत"]
['Lo', 'Mc', 'Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Zs', 'Lo', 'Mc', 'Lo', 'Lo']

fnielsen commented 5 years ago

I wonder why you didn't type in the form of രണ്ട് at https://www.wikidata.org/wiki/Lexeme:L2389? The matching of text-to-lexemes happens with the forms.

fnielsen commented 5 years ago

It might work now: https://tools.wmflabs.org/ordia/text-to-lexemes?text-language=ml&text=%E0%B4%AA%E0%B5%82%E0%B4%A4%E0%B4%95%E0%B5%8D%E0%B4%95%E0%B5%81%E0%B4%9F%E0%B5%8D%E0%B4%9F%E0%B5%BB%20%20%E0%B4%AA%E0%B5%82%E0%B4%A4%E0%B4%95%E0%B5%8D%E0%B4%95%E0%B5%81%E0%B4%9F%E0%B5%8D%E0%B4%9F%E0%B5%BB

johnsamuelwrites commented 5 years ago

I checked, it's tokenizing correctly. Thanks. I will add the forms.

bodhisattwawiki commented 5 years ago

For Bengali, #48