Vuizur / tatoeba-to-anki

Creates Anki Flash cards from Tatoeba sentences, ordering them by difficulty and downloading audio
GNU General Public License v3.0
7 stars 2 forks source link

Chinese error unicodedata.normalize #2

Closed ghost closed 2 years ago

ghost commented 2 years ago

This error shows up when trying to process chinese (mandarin, or any other)

❯ poetry run python ./tatoeba_to_anki/main.py
Config loaded
Sentences loaded
Building prefix dict from /home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/wordfreq/data/jieba_zh.txt ...
Loading model from cache /tmp/jieba.u19dca2b4d4fe5e1915d240249cebd313.cache
Loading model cost 0.213 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
  File "/home/yaoberh/tatoeba-to-anki/./tatoeba_to_anki/main.py", line 38, in <module>
    sorted_sentences = order_sentences(df, config["source_language"])
  File "/home/yaoberh/tatoeba-to-anki/tatoeba_to_anki/sort_sentences.py", line 24, in order_sentences
    df["sentence_word_frequency"] = df["target_sentence"].apply(
  File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/pandas/core/series.py", line 4433, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1082, in apply
    return self.apply_standard()
  File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1137, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer
  File "/home/yaoberh/tatoeba-to-anki/tatoeba_to_anki/sort_sentences.py", line 25, in <lambda>
    lambda sentence: get_sentence_word_frequency(sentence, source_lang)
  File "/home/yaoberh/tatoeba-to-anki/tatoeba_to_anki/sort_sentences.py", line 8, in get_sentence_word_frequency
    words = wordfreq.tokenize(sentence, source_lang)
  File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/wordfreq/tokens.py", line 261, in tokenize
    text = preprocess_text(text, language)
  File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/wordfreq/preprocess.py", line 172, in preprocess_text
    text = unicodedata.normalize(info["normal_form"], text)
TypeError: normalize() argument 2 must be str, not float
Vuizur commented 2 years ago

This should be fixed now