bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Update flagged_words.py #374

Closed majauhar closed 2 years ago

majauhar commented 2 years ago

Language: Hindi/Hi

Remark: These words are used in medical literature/advisories.

HugoLaurencon commented 2 years ago

Thank you @majauhar for your help! Have you also checked the other rules described in paragraph 2.6 of this doc? For examples, words like "pissing", "pube", "pussies" should be removed in your list. We should also remove every English word in the Hindi list. Let me know if you need more clarification!

majauhar commented 2 years ago

Hi @HugoLaurencon . Yes, I did visit that doc before. I should be removing all the English language words from the repo as well if I understand it well. However, there are other words that are transliterated into English. Should I be removing those words as well?

HugoLaurencon commented 2 years ago

Hi @majauhar you can remove the English words only in your list ("hi"). Yes I think you can remove these words too! Thank you

majauhar commented 2 years ago

@HugoLaurencon I have removed those words.

HugoLaurencon commented 2 years ago

Thank you!