davidks13 commented 5 months ago

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

[x] Have you signed your commits? Use git commit -s to sign.
[x] Do all unittests finish successfully before sending PR? 1) pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')). 2) Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
[x] If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
[x] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
[ ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
[x] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
[x] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
[x] Remove import guards (try import: ... except: ...) if not already done.
[] If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
[x] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation
[x] Test

If you haven't finished some of the above items you can still open "Draft" PR.

davidks13 commented 5 months ago

Hello I have written Armenian Inverse Text Normalization for NeMo Text Processing module. It would've been great to integrate a Text Normalization for Armenian in this package.

I have added Inverse Normalization for cardinal numbers, ordinal numbers, time, measure and etc by using other languages as examples.

Also I have already written and tested the code both with sparrowhawk and pytest, during this step I've faced a problem where I git cloned project twice, the first time all the tests passed, the second time I've got this error where the project tries to import Armenian module from pip library instead of local. After that I've cloned project the third time and this time everything worked like at the first time. What can cause this problem?

I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without, and would like to hear your thoughts about it. Thank you!

ekmb commented 5 months ago

@davidks13 thank you for the PR! Could you please fix CodeQL errors?

I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without

It is great that you have ITN ready and can also contribute TN. The test cases for TN and ITN are located in different subfolders - e.g. TN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_text_normalization and ITN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_inverse_text_normalization

pytests have separate tests for TN and ITN, e.g. https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py (ITN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L34 and same file TN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L60)

ekmb commented 5 months ago

@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!

tbartley94 commented 5 months ago

@davidks13 Ooh, Armenian ITN. Thanks for the contribution!

For documentation purposes, could you clarify if you intend this for Western or Eastern dialects (or just which language knowledge you're working off of)? (I know there's some overlap, but it's been a few years since I've worked on the language.)

davidks13 commented 5 months ago

@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!

I had a look at English test files and saw no context there and did the same for Armenian. I will add context for a few examples. Can you clarify what kind of context is needed? (one word after/before class or two)

davidks13 commented 5 months ago

@davidks13 Ooh, Armenian ITN. Thanks for the contribution!

For documentation purposes, could you clarify if you intend this for Western or Eastern dialects (or just which language knowledge you're working off of)? (I know there's some overlap, but it's been a few years since I've worked on the language.)

It is made for Eastern Armenian.

ekmb commented 5 months ago

@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!

I had a look at English test files and saw no context there and did the same for Armenian. I will add context for a few examples. Can you clarify what kind of context is needed? (one word after/before class or two)

@davidks13 indeed such examples are a small fraction of test cases, just to make sure the grammar handles not only standalone semiotic classes but also cases where they're part of a sentence. Examples - https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/data_inverse_text_normalization/test_cases_time.txt#L28 or https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/data_text_normalization/test_cases_date.txt#L46 [you can add random text before/after semiotic tokens]. Thanks!

davidks13 commented 5 months ago

@davidks13 thank you for the PR! Could you please fix CodeQL errors?
I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without
It is great that you have ITN ready and can also contribute TN. The test cases for TN and ITN are located in different subfolders - e.g. TN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_text_normalization and ITN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_inverse_text_normalization

pytests have separate tests for TN and ITN, e.g. https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py (ITN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L34 and same file TN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L60)

I made a PR for Armenian TN. Thank you! https://github.com/NVIDIA/NeMo-text-processing/pull/137

tbartley94 commented 4 months ago

@ekmb Pytests and sparrowhawk tests all pass on my local. Code base looks stable and all requested changes are good on my end.

tbartley94 commented 4 months ago

@davidks13 Merging this into main. Thanks a lot for the contribution! Really good work and love seeing some Armenian support out here. I'll be getting to the TN throughout the day.

NVIDIA / NeMo-text-processing

Armenian itn #136

What does this PR do ?

Before your PR is "Ready for review"