bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Remove some characters from Chinese bad word list #391

Closed JetRunner closed 2 years ago

JetRunner commented 2 years ago

The character "性" means sex, gender but it can also be a suffix to turn a noun into an adjective and it forms many normal words as well. For example, one-time (一次性), characteristic (性质), gender (性别), 重复性 (repetitive), etc.

Other removals are similar. This list may be too aggressive for filtering so I removed many characters that can form normal words. Removing any content containing these characters from the corpus can be catastrophic.

HugoLaurencon commented 2 years ago

Hello @JetRunner and thank you for your help!

Have you removed words according to the guidelines we defined in the section 2.6 of this pdf?

The list indeed is surely really bad at the beginning and I’m surprised only a couple of words were removed. For example “13” is still here and shouldn’t.

Would it be possible to check this please? Thank you in advance!

JetRunner commented 2 years ago

Hello @JetRunner and thank you for your help!

Have you removed words according to the guidelines we defined in the section 2.6 of this pdf?

The list indeed is surely really bad at the beginning and I’m surprised only a couple of words were removed. For example “13” is still here and shouldn’t.

Would it be possible to check this please? Thank you in advance!

That's a good reference. I'll remove more according to the guideline.

JetRunner commented 2 years ago

@HugoLaurencon Done

HugoLaurencon commented 2 years ago

Thank you!