[Feature] Filter toxic words/sentences from a given text

Bytes-Explorer commented 2 months ago

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Implement a new feature to detect and eliminate toxic words or sentences from a given text. This functionality should work on every row of a parquet file, where every row contains one document. The output should be True or False, and this should be added as an output column along with span of the text where toxic words are detected.

This can be added as a new transform for text/NLP data. One can refer to code quality module as a reference for how filters have been applied for code data.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

daw3rd commented 1 month ago

Is this a duplicate of #389

Bytes-Explorer commented 1 month ago

No, it is not. There is some overlap in the labels of hate, toxicity but they have been used as separate labels as well in the literature.

IBM / data-prep-kit