Digits Remain Unnormalized in European Languages Output

dmylzenova commented 1 month ago

Hello,

I have observed an issue where digits remain unnormalized in the output text when using the Nemo text normalization library, specifically with European languages such as German (de), Italian (it), and French (fr). This behavior occurs even though the expected output should not contain any digits.

Here is an example:

from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case="cased", lang="it")
text = "il 48% ha risposto che avrebbe dovuto provenire dal proprio budget."
norm_text = normalizer.normalize(text, punct_post_process=True)
print(norm_text)

Expected output: No digits in the normalized text. Actual output: 'il 48% ha risposto che avrebbe dovuto provenire dal proprio budget.'

Additional Examples:

Other examples with similart behavior in format (text, normalized_text):

[('Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.',
  'Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.'),
 ('Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.',
  'Aber die Tatsache, dass andere Leute bieten nur 800.000 zu diesem Zeitpunkt der Marktpreis ist auch 800.000.'),
 ('Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.',
  'Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.'),
 ('Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.',
  'Ich gebe Ihnen ein anderes Beispiel: Wenn Sie einmal unseren OPP sprechen und ich gebe Ihnen auf der Stelle 1.000 Dollar.'),
 ('Il y a 1,08 milliard de vaches dans le monde qui émettent 18% des émissions de carbone.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone'),
 ('Ci sono 1,08 miliardi di mucche nel mondo che emettono il 18% delle emissioni di carbonio.',
  'Il y a un virgule zéro huit milliard de vaches dans le monde qui émettent 18% des émissions de carbone.')]

Expected Behavior: The normalized text should not contain any digits.

Actual Behavior: Digits are retained in the normalized output, which contradicts the expected behavior of a text normalization tool. This issue does not occur consistently but appears sometimes which is particularly problematic for tasks that require clean, digit-free text—such as grapheme-to-phoneme (g2p) conversion.

Environment:

Nemo version: I use nemo_text_processing with version==0.3.0rc0. Python version: Python 3.11.8

zoobereq commented 1 month ago

We are addressing the issue with the % not normalizing in Italian and French. This fix will be available shortly and will also cause the numbers in these languages to normalize correctly.
We are aware of h and some other units not normalizing in French and are working to address that.
We are aware of period-separated numbers not normalizing in German (numbers without period-separators normalize correctly). We are working to address that as well.

github-actions[bot] commented 2 days ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

NVIDIA / NeMo-text-processing

Digits Remain Unnormalized in European Languages Output #171