russian inverse text normalization is broke for numbers less than 10

NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html

Apache License 2.0

242 stars 76 forks source link

russian inverse text normalization is broke for numbers less than 10 #146

Closed vikcost closed 3 months ago

vikcost commented 4 months ago

Example:

from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
inv_norm = InverseNormalizer(lang='ru')

inv_norm.normalize('тридцать')
'30' 

inv_norm.normalize('три')
'три' # expected '3'

inv_norm.normalize('два')
'два' expected '2'

ekmb commented 4 months ago

This is intended behaviour to keep numbers <10 in their spoken form, you can comment out https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/inverse_text_normalization/ru/taggers/cardinal.py#L44 to avoid this.

vikcost commented 4 months ago

Thanks, I figured this too. I wonder what's the intended use case for such a behavior?

One might expect that (inverse)normalization has a universal behavior across the languages.

vikcost commented 4 months ago

nevertheless, even with a suggested change, inverse normalization of ordinals is error-prone.

третий год -> третий год # expected '3-й год'
тридцать третий час -> 33 час # expected '33-й час'

ekmb commented 4 months ago

nevertheless, even with a suggested change, inverse normalization of ordinals is error-prone.
третий год -> третий год # expected '3-й год'
тридцать третий час -> 33 час # expected '33-й час'

https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/inverse_text_normalization/ru/taggers/ordinal.py#L42 should be commented out for ordinals too.

ekmb commented 4 months ago

Thanks, I figured this too. I wonder what's the intended use case for such a behavior?

One might expect that (inverse)normalization has a universal behavior across the languages.

The motivation is to avoid normalization for cases like one of us.

Hannan-Komari commented 3 months ago

Collaborator

There is such a problem in English. What to do to resolve it in English?

vikcost commented 3 months ago

@ekmb thanks for the example.

Intuitively it should be possible to build a graph that accounts for cases as "one of us" and returns identity without any inverse-normalization.

On the other hand, it's is totally fine to expect the following "one of us" -> "1 of us".

ekmb commented 3 months ago

Collaborator

There is such a problem in English. What to do to resolve it in English?

You'd need to replace this line with self.graph = graph