Convert non-latin languages to latin script before calculating orthographic similarity - Githubissues

StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.

https://akkerman.ai/FluentAI/

MIT License

9 stars 1 forks source link

Convert non-latin languages to latin script before calculating orthographic similarity #42

Closed StephanAkkerman closed 3 weeks ago

StephanAkkerman commented 3 weeks ago

Description:
- Problem: For non-latin languages the orthographic similarity cannot be calculated, because it will be 0
- Solution: Converting it to a latin script (romaji for Japanse) would be the best
- Prerequisites: Look into methods that support the most languages
Tasks:
- Convert it to latin / roman scipt before calculating orthographic sim

Additional context

  token_ort token_ipa  distance  imageability  orthographic_similarity  semantic_similarity
75789     mauer      maʊɝ  0.923380      0.483503                      0.0             0.000000
75778       mau       maʊ  0.923380      0.564250                      0.0             0.329258
74445       mao       maʊ  0.923380      0.452691                      0.0             0.361128

StephanAkkerman commented 3 weeks ago

https://github.com/cburgmer/cjklib: very old (12 years ago)
https://github.com/polm/cutlet: Japanese
https://pypi.org/project/Unidecode/: not handy for chinese / japanese

StephanAkkerman commented 3 weeks ago

The method of doing this is called "transliteration"

https://github.com/elibooklover/Transliter: Transliter supports Korean, Japanese, Russian, Ukrainian, Bulgarian, Macedonian, Mongolian, Montenegrin, Serbian, and Tajiki.
https://github.com/barseghyanartur/transliterate: supports Armenian Bulgarian (beta) Georgian Greek Macedonian (alpha) Mongolian (alpha) Russian Serbian (alpha) Ukrainian (beta)
Arabic: https://github.com/CAMeL-Lab/camel_tools
Chinese: https://github.com/tsroten/dragonmapper / https://github.com/lxyu/pinyin

https://github.com/aboSamoor/polyglot (69 languages (nice)) (does not install with pip..) https://github.com/3aransia/3aransia (70 languages (seems like a fork of polyglot) (different than google translate) https://pypi.org/project/PyICU/ -> popular but hard to install

StephanAkkerman commented 3 weeks ago

Maybe we can install PyICU easily using this: https://github.com/cgohlke/pyicu-build (does not work)

StephanAkkerman commented 3 weeks ago

Best solution without any extra dependencies: google trans Also return the transliteration during the semantic process

use google trans and translate it to the same dest and take the pronunciation

from googletrans import Translator

translator = Translator()
print(translator.translate("안녕하세요.", dest="ko"))