NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
241 stars 76 forks source link

Digits Remain Unnormalized in European Languages Output #171

Open dmylzenova opened 1 month ago

dmylzenova commented 1 month ago

Hello,

I have observed an issue where digits remain unnormalized in the output text when using the Nemo text normalization library, specifically with European languages such as German (de), Italian (it), and French (fr). This behavior occurs even though the expected output should not contain any digits.

Here is an example:

from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case="cased", lang="it")
text = "il 48% ha risposto che avrebbe dovuto provenire dal proprio budget."
norm_text = normalizer.normalize(text, punct_post_process=True)
print(norm_text)

Expected output: No digits in the normalized text. Actual output: 'il 48% ha risposto che avrebbe dovuto provenire dal proprio budget.'

Additional Examples:

Other examples with similart behavior in format (text, normalized_text):

[('Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.',
  'Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.'),
 ('Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.',
  'Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.'),
 ('Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.',
  'Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.'),
 ('Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.',
  'Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.'),
 ('Il y a 1,08 milliard de vaches dans le monde qui émettent 18% des émissions de carbone.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone'),
 ('Ci sono 1,08 miliardi di mucche nel mondo che emettono il 18% delle emissioni di carbonio.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone.')]

Expected Behavior: The normalized text should not contain any digits.

Actual Behavior: Digits are retained in the normalized output, which contradicts the expected behavior of a text normalization tool. This issue does not occur consistently but appears sometimes which is particularly problematic for tasks that require clean, digit-free text—such as grapheme-to-phoneme (g2p) conversion.

Environment:

Nemo version: I use nemo_text_processing with version==0.3.0rc0. Python version: Python 3.11.8

zoobereq commented 1 month ago
github-actions[bot] commented 2 days ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.