gayu-thri commented 1 year ago

What does this PR do ?

This PR adds a new feature in ITN - EN for filtering profane words. With this, profane words in the input text would be redacted with * symbol.

Before your PR is "Ready for review"

Pre checks:

[x] Have you signed your commits? Use git commit -s to sign.
[x] Do all unittests finish successfully before sending PR? 1) pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')). 2) Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
[x] If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
[ ] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
[ ] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
[x] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
[ ] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
[x] Remove import guards (try import: ... except: ...) if not already done.
[ ] If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
[ ] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation
[ ] Test

If you haven't finished some of the above items you can still open "Draft" PR.

gayu-thri commented 11 months ago

Following up on this as the suggested changes are already made few weeks back and PR is not merged yet.

If there are anymore changes that has to be made before merging, please let me know regarding the same.

mgrafu commented 11 months ago

After reviewing this PR, we have decided not to merge it for the following reasons:

The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.
Conceptually, this type of filtering is not a TN/ITN task. If a user wanted to filter profanity, chances are that it would already have been filtered in the audio; thus, it would not appear in the text before ITN in the first place. Otherwise, the filtering would most likely be addressed further downstream.

Thank you for your effort — we look forward to future contributions.

gayu-thri commented 11 months ago

Thank you for your effort — we look forward to future contributions.

Thanks. Sure.

The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.

I'd like to clarify this. Isn't profanity filtering a different kind of transformation which is not applicable to all whitelisted words?

Of course, we could add on a pre-defined list of pairs with both spoken and written form (filtered version) to the whitelist.

But if it has to be handled in grammar-level, wouldn't maintaining a separate classifier be better?

NVIDIA / NeMo-text-processing

Profanity filtering for ITN - EN #86

What does this PR do ?

Before your PR is "Ready for review"