zh TN is very slow and bad accuracy - Githubissues

NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html

Apache License 2.0

242 stars 77 forks source link

zh TN is very slow and bad accuracy #115

Closed lifeiteng closed 2 months ago

lifeiteng commented 8 months ago

one simple zh-CN sentence costs 1.32 sec and the result is not right.

>python normalize.py --text="123" --language=en
INFO:NeMo-text-processing:one hundred and twenty three
WARNING:NeMo-text-processing:Execution time: 0.02 sec

>python normalize.py --text="我出生于1998年7月22日" --language=zh
INFO:NeMo-text-processing:我出生于1998年7月22日
WARNING:NeMo-text-processing:Execution time: 1.32 sec

>python normalize.py --text="I'm born in 22/3/1990" --language=en
INFO:NeMo-text-processing:I'm born in the twenty second of march nineteen ninety
WARNING:NeMo-text-processing:Execution time: 0.02 sec

ekmb commented 8 months ago

@BuyuanCui could you please take a look?

BuyuanCui commented 8 months ago

This seems to be related to the existing TN bug. It was not able to process a whole sentence. It will be fixed with the PR that I'm working.

ekmb commented 8 months ago

@lifeiteng a few options to speed up:

use --cache_dir
use normalize_list() https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize.py#L75

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

lsrami commented 2 months ago

这似乎与现有的 TN 错误有关。它无法处理整个句子。它将通过我正在工作的 PR 修复。

This seems to be related to the existing TN bug. It was not able to process a whole sentence. It will be fixed with the PR that I'm working.

Whether the relevant problem has been solved? There are still problems in version 0.3.0

ekmb commented 2 months ago

@BuyuanCui https://github.com/NVIDIA/NeMo-text-processing/pull/112

riqiang-dp commented 2 months ago

I've found that the TN FST is slow regardless of language (English too). It is not very practical with large data even using multiprocessing (normalize_list()). Any other ways to speed it up?

ekmb commented 2 months ago

@riqiang-dp we recommend Sparrowhawk for deployment https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/text_normalization/wfst/wfst_text_processing_deployment.html