NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
242 stars 76 forks source link

Some bugs in English, German, Spanish, Italian normalizers #166

Closed Oktai15 closed 4 weeks ago

Oktai15 commented 2 months ago

Hi!

I found a bug in English normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="en",
  deterministic=True,
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is mail.nasa.gov. norm_text=Here is mail dot nasa dot gov dot expected output=Here is mail dot nasa dot gov.

Similar bug can be reached in German normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="de",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=Here is brettspielversand.de. norm_text=Here is b r e t t s p i e l v e r s a n d punkt de punkt expected output=Here is brettspielversand punkt de.

Similar problem with text=KIM.com-Specials.. I got same problem with website in text on Spanish and Italian.

I also found a specific bug in Spanish normalization. The following code is applied:

normalizer = Normalizer(
  input_case="cased",
  lang="es",
)
norm_text = normalizer.normalize(text, punct_post_process=True)

text=El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. norm_text=El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico. Not sure what is expected output, but current norm_text looks not okay.

dmylzenova commented 1 month ago

I aslo met similar behavior:

text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen." normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."

zoobereq commented 1 month ago

I aslo met similar behavior:

text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen." normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."

The above is expected behavior. The normalizer assumes that consecutive sentences are separated by a period and at least one whitespace. The string quoted above comprises two clauses separated by a period without whitespaces. Adding a whitespace after the period induces correct normalization.