Closed davidks13 closed 4 months ago
Hello I have written Armenian Inverse Text Normalization for NeMo Text Processing module. It would've been great to integrate a Text Normalization for Armenian in this package.
I have added Inverse Normalization for cardinal numbers, ordinal numbers, time, measure and etc by using other languages as examples.
Also I have already written and tested the code both with sparrowhawk and pytest, during this step I've faced a problem where I git cloned project twice, the first time all the tests passed, the second time I've got this error where the project tries to import Armenian module from pip library instead of local. After that I've cloned project the third time and this time everything worked like at the first time. What can cause this problem?
I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without, and would like to hear your thoughts about it. Thank you!
@davidks13 thank you for the PR! Could you please fix CodeQL errors?
I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without
It is great that you have ITN ready and can also contribute TN. The test cases for TN and ITN are located in different subfolders - e.g. TN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_text_normalization and ITN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_inverse_text_normalization
pytests have separate tests for TN and ITN, e.g. https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py (ITN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L34 and same file TN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L60)
@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!
@davidks13 Ooh, Armenian ITN. Thanks for the contribution!
For documentation purposes, could you clarify if you intend this for Western or Eastern dialects (or just which language knowledge you're working off of)? (I know there's some overlap, but it's been a few years since I've worked on the language.)
@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!
I had a look at English test files and saw no context there and did the same for Armenian. I will add context for a few examples. Can you clarify what kind of context is needed? (one word after/before class or two)
@davidks13 Ooh, Armenian ITN. Thanks for the contribution!
For documentation purposes, could you clarify if you intend this for Western or Eastern dialects (or just which language knowledge you're working off of)? (I know there's some overlap, but it's been a few years since I've worked on the language.)
It is made for Eastern Armenian.
@davidks13 could you please add a few examples per class with context to make sure the pipeline works correctly as @Borismartirosyan pointed out? Thanks!
I had a look at English test files and saw no context there and did the same for Armenian. I will add context for a few examples. Can you clarify what kind of context is needed? (one word after/before class or two)
@davidks13 indeed such examples are a small fraction of test cases, just to make sure the grammar handles not only standalone semiotic classes but also cases where they're part of a sentence. Examples - https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/data_inverse_text_normalization/test_cases_time.txt#L28 or https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/data_text_normalization/test_cases_date.txt#L46 [you can add random text before/after semiotic tokens]. Thanks!
@davidks13 thank you for the PR! Could you please fix CodeQL errors?
I would like to add that Text Normalization for Armenian is ready too, but the issue is both TN and ITN use same pytest files so I don’t know how to make TN pull request, whether with ITN code or without
It is great that you have ITN ready and can also contribute TN. The test cases for TN and ITN are located in different subfolders - e.g. TN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_text_normalization and ITN cases for Eng: https://github.com/NVIDIA/NeMo-text-processing/tree/main/tests/nemo_text_processing/en/data_inverse_text_normalization
pytests have separate tests for TN and ITN, e.g. https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py (ITN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L34 and same file TN: https://github.com/NVIDIA/NeMo-text-processing/blob/main/tests/nemo_text_processing/en/test_cardinal.py#L60)
I made a PR for Armenian TN. Thank you! https://github.com/NVIDIA/NeMo-text-processing/pull/137
@ekmb Pytests and sparrowhawk tests all pass on my local. Code base looks stable and all requested changes are good on my end.
@davidks13 Merging this into main. Thanks a lot for the contribution! Really good work and love seeing some Armenian support out here. I'll be getting to the TN throughout the day.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
). 2) Sparrowhawk testsbash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.