alpheios-project / alignment-editor-new

The Alpheios Translation Alignment editor allows you to create word-by-word alignments between multiple texts.
5 stars 0 forks source link

test language support in tokenization #145

Closed monzug closed 3 years ago

monzug commented 3 years ago

as we had issue with Chinese (and Irina said also Russian, Ukrainian, Thai, Vietnamese), let's test all language in Target or Origin texts. so far, I did all languages starting with A, B and C.

irina060981 commented 3 years ago

Related to the issue https://github.com/alpheios-project/tokenizer/issues/33 and PR - https://github.com/alpheios-project/tokenizer/pull/34

monzug commented 3 years ago

@irina060981 I got a 500 error when using a file saved as japanese language

irina060981 commented 3 years ago

Yes it is right - we have two unsupported languages - Japanese and Korean It is described here - https://github.com/alpheios-project/tokenizer/issues/33 And I duplicated this info by email 2 Feburary

monzug commented 3 years ago

Sorry, I didn't see Japanese in the email that's why I added it here. will wait when completed to finish testing all languages

On Mon, Feb 8, 2021 at 1:53 AM Sklyarova Irina notifications@github.com wrote:

Yes it is right - we have two unsupported languages - Japanese and Korean It is described here - alpheios-project/tokenizer#33 https://github.com/alpheios-project/tokenizer/issues/33 And I duplicated this info by email 2 Feburary

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/alpheios-project/alignment-editor-new/issues/145#issuecomment-774803615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ32UOIQRXILIASVYKKOBVLS54YYDANCNFSM4W6T6ADA .

monzug commented 3 years ago

Confirm that it's working for russian, ukrainain, thai and vietnamese but as said in alpheios-project/tokenizer#33 , I am still getting a 500 error for japanese and korean languages.

tested also languages beginning with letter d and e.

monzug commented 3 years ago

Telugu and Sanskrit also give a 500 error. see attachment

Screen Shot 2021-03-15 at 2 43 03 PM

monzug commented 3 years ago

in the drop down we have the Ukainian language which I have never heard of. @irina060981 , Could it be a spelling error for Ukrainian? if yes, you could use this issue to fix the misspelled language. Thanks. I will add the two languages with 500 error to alpheios-project/tokenizer#33

irina060981 commented 3 years ago

@monzug , I have a suggestion - may be it is worth to create a new issue for each language fail with text samples? It would be useful for the developer - to investigate and test it with the ready text sample

monzug commented 3 years ago

I used this text for Telugu language: Pratipattisvatvamula visyamuna mānavulellarunu janmataḥ svataṁtrulunu samānulunu naguduru.

Vāru vivēdanāṁtaḥkaraṇa saṁpannulaguṭacaē parasparamu bhrātṛbhāvamutō vartiṁpavalayunu

in English: All human beings are born free and equal in dignity and rights.

They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

monzug commented 3 years ago

and the same text for Sanskrit from https://omniglot.com/writing/sanskrit.htm

Sarvē mānavāḥ svatantrāḥ samutpannāḥ vartantē api ca, gauravadr̥śā adhikāradr̥śā ca samānāḥ ēva vartantē.

Ētē sarvē cētanā-tarka-śaktibhyāṁ susampannāḥ santi. Api ca, sarvē´pi bandhutva-bhāvanayā parasparaṁ vyavaharantu.

monzug commented 3 years ago

created two new issues to report the problem with Telugu and Sanskrit languages and the misspelled error.