This error shows up when trying to process chinese (mandarin, or any other)
❯ poetry run python ./tatoeba_to_anki/main.py
Config loaded
Sentences loaded
Building prefix dict from /home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/wordfreq/data/jieba_zh.txt ...
Loading model from cache /tmp/jieba.u19dca2b4d4fe5e1915d240249cebd313.cache
Loading model cost 0.213 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
File "/home/yaoberh/tatoeba-to-anki/./tatoeba_to_anki/main.py", line 38, in <module>
sorted_sentences = order_sentences(df, config["source_language"])
File "/home/yaoberh/tatoeba-to-anki/tatoeba_to_anki/sort_sentences.py", line 24, in order_sentences
df["sentence_word_frequency"] = df["target_sentence"].apply(
File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/pandas/core/series.py", line 4433, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1082, in apply
return self.apply_standard()
File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1137, in apply_standard
mapped = lib.map_infer(
File "pandas/_libs/lib.pyx", line 2870, in pandas._libs.lib.map_infer
File "/home/yaoberh/tatoeba-to-anki/tatoeba_to_anki/sort_sentences.py", line 25, in <lambda>
lambda sentence: get_sentence_word_frequency(sentence, source_lang)
File "/home/yaoberh/tatoeba-to-anki/tatoeba_to_anki/sort_sentences.py", line 8, in get_sentence_word_frequency
words = wordfreq.tokenize(sentence, source_lang)
File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/wordfreq/tokens.py", line 261, in tokenize
text = preprocess_text(text, language)
File "/home/yaoberh/.cache/pypoetry/virtualenvs/tatoeba-to-anki-LZuydzh--py3.10/lib/python3.10/site-packages/wordfreq/preprocess.py", line 172, in preprocess_text
text = unicodedata.normalize(info["normal_form"], text)
TypeError: normalize() argument 2 must be str, not float
This error shows up when trying to process chinese (mandarin, or any other)