NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
266 stars 85 forks source link

es TN bug regarding the word 'o' #211

Open seunghunJi opened 1 month ago

seunghunJi commented 1 month ago

Describe the bug Hi, I recently started to use the Spanish text normalizer, and I found a bug. I don't expect the normalizer to convert the conjunction 'o' into another word 'oeste', but it seems to happen more often than not. I'm not proficient in the Spanish language but still I don't think it is a correct way to normalize a Spanish sentence. Is it an expected behavior, or is it a known bug? I appreciate if you guys take a look. The three sentences are just random sentences I got from a Spanish dictionary. The version of nemo_text_normalizer I am using is 1.0.2.

Steps/Code to reproduce bug Python code:

from nemo_text_processing.text_normalization import Normalizer

text_normalizer = Normalizer(input_case="lower_cased", lang="es", post_process=True)

text = ["Norte o Sur?", "O te callas, o me marcho.", "Date prisa, o perderás el tren."]

for t in text:
    print(t, "->", text_normalizer.normalize(t, punct_post_process=True, punct_pre_process=True))

Output:

Norte o Sur? -> Norte oeste Sur?
O te callas, o me marcho. -> O te callas, oeste me marcho.
Date prisa, o perderás el tren. -> Date prisa, oeste perderás el tren.

Expected behavior No normalization on any of the 'o's in the above sentences.

Thanks!

Oktai15 commented 2 weeks ago

@mgrafu @elenaltz @ekmb, 1.1.0 version still doesn't work well on test: O te callas, o me marcho.? Could you check it?