Closed JetRunner closed 2 years ago
Hello @JetRunner and thank you for your help!
Have you removed words according to the guidelines we defined in the section 2.6 of this pdf?
The list indeed is surely really bad at the beginning and I’m surprised only a couple of words were removed. For example “13” is still here and shouldn’t.
Would it be possible to check this please? Thank you in advance!
Hello @JetRunner and thank you for your help!
Have you removed words according to the guidelines we defined in the section 2.6 of this pdf?
The list indeed is surely really bad at the beginning and I’m surprised only a couple of words were removed. For example “13” is still here and shouldn’t.
Would it be possible to check this please? Thank you in advance!
That's a good reference. I'll remove more according to the guideline.
@HugoLaurencon Done
Thank you!
The character "性" means sex, gender but it can also be a suffix to turn a noun into an adjective and it forms many normal words as well. For example, one-time (一次性), characteristic (性质), gender (性别), 重复性 (repetitive), etc.
Other removals are similar. This list may be too aggressive for filtering so I removed many characters that can form normal words. Removing any content containing these characters from the corpus can be catastrophic.