dmort27 / epitran

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)
MIT License
625 stars 120 forks source link

Many Chinese words are not transcribed by cmn-Hans/cmn-Hant #132

Open stefantaubert opened 1 year ago

stefantaubert commented 1 year ago

Epitran didn't transcribe the vocabulary in failed.txt (4274 entries).

Is there a possibility to support a transcription of these entries?

Commands I've run:

import epitran
from pathlib import Path

epitran.download.cedict()
epi = epitran.Epitran('cmn-Hans', tones=False, ligatures=True, cedict_file="/home/mi/epitran_data/cedict.txt")
epi2 = epitran.Epitran('cmn-Hant', tones=False, ligatures=True, cedict_file="/home/mi/epitran_data/cedict.txt")
voc = Path("/home/mi/playground/chn/vocabulary.txt").read_text("UTF-8").splitlines()
failed = []

for v in voc:
  result = epi.transliterate(v)
  if result == v:
    result = epi2.transliterate(v)
    if result == v:
      failed.append(result)
Path("/tmp/failed.txt").write_text("\n".join(failed), "UTF-8")

Epitran version: 1.22