NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
242 stars 76 forks source link

Armenian itn #136

Closed davidks13 closed 4 months ago

davidks13 commented 5 months ago

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

PR Type:

If you haven't finished some of the above items you can still open "Draft" PR.

davidks13 commented 5 months ago

Hello I have written Armenian Inverse Text Normalization for NeMo Text Processing module. It would've been great to integrate a Text Normalization for Armenian in this package.

I have added Inverse Normalization for cardinal numbers, ordinal numbers, time, measure and etc by using other languages as examples.

Also I have already written and tested the code both with sparrowhawk and pytest, during this step I've faced a problem where I git cloned project twice, the first time all the tests passed, the second time I've got this error where the project tries to import Armenian module from pip library instead of local. After that I've cloned project the third time and this time everything worked like at the first time. What can cause this problem?

I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without, and would like to hear your thoughts about it. Thank you!

ekmb commented 5 months ago

@davidks13 thank you for the PR! Could you please fix CodeQL errors?

I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without

It is great that you have ITN ready and can also contribute TN. The test cases for TN and ITN are located in different subfolders - e.g. TN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_text_normalization and ITN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_inverse_text_normalization

pytests have separate tests for TN and ITN, e.g. https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py (ITN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L34 and same file TN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L60)

ekmb commented 5 months ago

@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!

tbartley94 commented 5 months ago

@davidks13 Ooh, Armenian ITN. Thanks for the contribution!

For documentation purposes, could you clarify if you intend this for Western or Eastern dialects (or just which language knowledge you're working off of)? (I know there's some overlap, but it's been a few years since I've worked on the language.)

davidks13 commented 5 months ago

@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!

I had a look at English test files and saw no context there and did the same for Armenian. I will add context for a few examples. Can you clarify what kind of context is needed? (one word after/before class or two)

davidks13 commented 5 months ago

@davidks13 Ooh, Armenian ITN. Thanks for the contribution!

For documentation purposes, could you clarify if you intend this for Western or Eastern dialects (or just which language knowledge you're working off of)? (I know there's some overlap, but it's been a few years since I've worked on the language.)

It is made for Eastern Armenian.

ekmb commented 5 months ago

@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!

I had a look at English test files and saw no context there and did the same for Armenian. I will add context for a few examples. Can you clarify what kind of context is needed? (one word after/before class or two)

@davidks13 indeed such examples are a small fraction of test cases, just to make sure the grammar handles not only standalone semiotic classes but also cases where they're part of a sentence. Examples - https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/data_inverse_text_normalization/test_cases_time.txt#L28 or https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/data_text_normalization/test_cases_date.txt#L46 [you can add random text before/after semiotic tokens]. Thanks!

davidks13 commented 5 months ago

@davidks13 thank you for the PR! Could you please fix CodeQL errors?

I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without

It is great that you have ITN ready and can also contribute TN. The test cases for TN and ITN are located in different subfolders - e.g. TN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_text_normalization and ITN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_inverse_text_normalization

pytests have separate tests for TN and ITN, e.g. https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py (ITN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L34 and same file TN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L60)

I made a PR for Armenian TN. Thank you! https://github.com/NVIDIA/NeMo-text-processing/pull/137

tbartley94 commented 4 months ago

@ekmb Pytests and sparrowhawk tests all pass on my local. Code base looks stable and all requested changes are good on my end.

tbartley94 commented 4 months ago

@davidks13 Merging this into main. Thanks a lot for the contribution! Really good work and love seeing some Armenian support out here. I'll be getting to the TN throughout the day.