StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.
https://akkerman.ai/FluentAI/
MIT License
9 stars 1 forks source link

Convert non-latin languages to latin script before calculating orthographic similarity #42

Closed StephanAkkerman closed 3 weeks ago

StephanAkkerman commented 3 weeks ago
  1. Description:

    • Problem: For non-latin languages the orthographic similarity cannot be calculated, because it will be 0

    • Solution: Converting it to a latin script (romaji for Japanse) would be the best

    • Prerequisites: Look into methods that support the most languages

  2. Tasks:

    • Convert it to latin / roman scipt before calculating orthographic sim
  3. Additional context

      token_ort token_ipa  distance  imageability  orthographic_similarity  semantic_similarity
    75789     mauer      maʊɝ  0.923380      0.483503                      0.0             0.000000
    75778       mau       maʊ  0.923380      0.564250                      0.0             0.329258
    74445       mao       maʊ  0.923380      0.452691                      0.0             0.361128
StephanAkkerman commented 3 weeks ago
StephanAkkerman commented 3 weeks ago

The method of doing this is called "transliteration"

https://github.com/aboSamoor/polyglot (69 languages (nice)) (does not install with pip..) https://github.com/3aransia/3aransia (70 languages (seems like a fork of polyglot) (different than google translate) https://pypi.org/project/PyICU/ -> popular but hard to install

StephanAkkerman commented 3 weeks ago

Maybe we can install PyICU easily using this: https://github.com/cgohlke/pyicu-build (does not work)

StephanAkkerman commented 3 weeks ago

Best solution without any extra dependencies: google trans Also return the transliteration during the semantic process

use google trans and translate it to the same dest and take the pronunciation

from googletrans import Translator

translator = Translator()
print(translator.translate("안녕하세요.", dest="ko"))