fnielsen / ordia

Wikidata lexemes presentations
https://ordia.toolforge.org
Apache License 2.0
24 stars 13 forks source link

Punjabi Gurmukhi: diacritical characters interpreted in text-to-lexemes input as word break #144

Closed bgo-eiu closed 1 year ago

bgo-eiu commented 1 year ago

Diacritical characters in the Gurmukhi script seem to get treated as breaks in words when entering Punjabi words into the text-to-lexemes input, while words lacking these characters seem to work fine. The text-to-lexemes tool also omits diacritical characters occurring at the end of words when they are passed to Wikidata to create a new lexeme. Some examples:

There are some character combinations where this does not happen. For example, ਕ੍ਰੋਧੀ works fine.

fnielsen commented 1 year ago

I have applied the patch to the server. I think it works now: https://ordia.toolforge.org/text-to-lexemes?text-language=pa&text=%E0%A8%B8%E0%A9%8B%E0%A8%82%E0%A8%A3%E0%A8%BE