Eurotermbank / Federated-Network-Toolkit-deployment

Other
2 stars 1 forks source link

Terms with accent (stress) marks [Lithuanian] #55

Open AstaMit opened 3 years ago

AstaMit commented 3 years ago

1) If I add Lithuanian terms with accent (stress) marks, seach in the collection does not return terms. But search in all collections (main page) shows "similar terms". image

image

image

2) Not all terms with accent marks are displayed correctly. Of course, we can import terms without accent marks, but a large part of Lithuanian resources (especially of top quality) contains accented terms. image

andish commented 2 years ago

@AstaMit, Can you share some sample collection with terms containing stress marks?

AstaMit commented 2 years ago

https://otk.lki.lt/collections/10 original file: https://e-seimas.lrs.lt/portal/legalAct/lt/TAD/TAIS.341647

Important: terms with stress marks should be found using the search function.

andish commented 2 years ago

Can you append TBX file? otk.lki.lt not accessible for me and the original file needs to be converted. You can export that collection as TBX and append here.

AstaMit commented 2 years ago

Here it is.

From: Andis @. Sent: Wednesday, February 16, 2022 4:36 PM To: Eurotermbank/Federated-Network-Toolkit-deployment @.> Cc: AstaMit @.>; Mention @.> Subject: Re: [Eurotermbank/Federated-Network-Toolkit-deployment] Terms with accent (stress) marks [Lithuanian] (#55)

Can you append TBX file? otk.lki.lt not accessible form and original file needs to be converted. You can export that collection as TBX and append here.

— Reply to this email directly, view it on GitHub https://github.com/Eurotermbank/Federated-Network-Toolkit-deployment/issues/55#issuecomment-1041556128 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AUVE524S6RYEGHG4OYF4G4LU3OY33ANCNFSM5EALLFAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AUVE52Y3HMSSE2L5HKOL7XTU3OY33A5CNFSM5EALLFAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOHYKOFIA.gif Message ID: @. @.> >

Marghis commented 2 years ago

Github doesn not allow adding files with TBX extension. Here is zipped version. lithuanian-accentuated-terms-collection-10.zip

The collection (Medicinal plants) is publicly available on otk.lki.lt. Prior to opening otk.lki.lt you need to open auth.otk.lki.lt and strapi.otk.lki.lt and choose on both URLs to continue despite invalid certificate - then otk.lki.lt can also be opened.

andish commented 2 years ago

This terminology resource is built using Lithuanian font Palemonas which contains hundreds of custom-made characters which are not part of Unicode standard. When publishing data to an environment which sticks to the standards, using custom-made solutions and fonts are not possible.

Is it right that these special diacritics are needed for the purpose of annotating pronunciation?

Is it possible to strip some of those down to simpler characters of Unicode Character Set?

E.g., here is a short excrept from the document showing two characters from the particular resource.

image

Very likely, these characters could be built using combined diacritics of the standard fonts. The decision is yours -- what is the best way to go to use these entries in processing of Lithuanian texts/terms?

In real life I find like https://www.medeinos.lt/augalu-katalogas/a/amurinis-adonis-adonis-amurensis/[](https://www.medeinos.lt/augalu-katalogas/a/amurinis-adonis-adonis-amurensis/)

Can you check if decoders from Palemonas-encoded text to simple text exist in Lithuanian Linguists' circles?

If not, someone might have to create such a translator in order to make those texts usable and more interoperable.

Marghis commented 2 years ago

Hi, we've converted custom Palemonas accented characters to combinations of symbols and diacritical marks compliant with Unicode. I've uploaded this version to our node (here is tbx from the node Medical Plants LT-LA collection accents-as-components.zip). Accented symbols are displayed correctly now, but words entered without stress mark can not be found using search. For example collection contains term kìninis ãbras, if I enter kininis (without accent) in search box, the term is not found. For it to be found, I need to enter kìninis - with combined stress mark.