NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
242 stars 77 forks source link

Profanity filtering for ITN - EN #86

Closed gayu-thri closed 11 months ago

gayu-thri commented 1 year ago

What does this PR do ?

This PR adds a new feature in ITN - EN for filtering profane words. With this, profane words in the input text would be redacted with * symbol.

Before your PR is "Ready for review"

Pre checks:

PR Type:

If you haven't finished some of the above items you can still open "Draft" PR.

gayu-thri commented 11 months ago

Following up on this as the suggested changes are already made few weeks back and PR is not merged yet.

If there are anymore changes that has to be made before merging, please let me know regarding the same.

mgrafu commented 11 months ago

After reviewing this PR, we have decided not to merge it for the following reasons:

  1. The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.
  2. Conceptually, this type of filtering is not a TN/ITN task. If a user wanted to filter profanity, chances are that it would already have been filtered in the audio; thus, it would not appear in the text before ITN in the first place. Otherwise, the filtering would most likely be addressed further downstream.

Thank you for your effort — we look forward to future contributions.

gayu-thri commented 11 months ago

Thank you for your effort — we look forward to future contributions.

Thanks. Sure.

  1. The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.

I'd like to clarify this. Isn't profanity filtering a different kind of transformation which is not applicable to all whitelisted words?

Of course, we could add on a pre-defined list of pairs with both spoken and written form (filtered version) to the whitelist.

But if it has to be handled in grammar-level, wouldn't maintaining a separate classifier be better?