Create a dictionary of obscene words

datafaker-net / datafaker

Generating fake data for the JVM (Java, Kotlin, Groovy) has never been easier!

https://www.datafaker.net

Apache License 2.0

1.16k stars 160 forks source link

Create a dictionary of obscene words #1272

Closed asolntsev closed 2 months ago

asolntsev commented 3 months ago

Is your feature request related to a problem? Please describe. I suggest to create a dictionary of obscene words. It may be useful for testing spam filters, blog prettifiers etc.

I can provide the list in English, Estonian and Russian.

Describe the solution you'd like It could look like this:

  Faker faker = new Faker(new Locale("en", "US"));
  String blacklistedWord = faker.dictionary().obsent(); // fuck | shit | ass

  Faker faker = new Faker(new Locale("ru", "RU"));
  String blacklistedWord = faker.dictionary().obsent(); // жопа

Additional context It also will be very motivating feature for people to submit their pull request. Imagine they need to add "fuck" on their languages. The easiest and funniest PR ever possible! :)

kingthorin commented 3 months ago

Did you truncate the method name on purpose? Shouldn't it be obscenity?

Anyway yes I support this idea.

Though I (we) would have zero ability to review other languages.

Also we should keep in mind (perhaps even document) that while some words/phrases are just "bad" some are truly hateful/hurtful and that's hard to nail down (even in languages that you are familiar with).

bodiam commented 3 months ago

Just a word of caution: we recently had a production issue with another faker library which was generating some offensive language which ended up in a customer demo, which was a bit of an unfortunate experience.

Also, I'd be hesitant to put words in which are too offensive, or subjective to offense such as racial references, I would prefer to keep this library as positive as possible, there's nothing from stopping people to write their own faker for cases like this.

kingthorin commented 3 months ago

Good point, it is also a perfect case for a custom faker with their own yaml or whatever.

snuyanzin commented 2 months ago

how about not only obscene words but also obscene expressions?

however yes, keeping it on a more positive side also makes sense

bodiam commented 2 months ago

how about not only obscene words but also obscene expressions?

What could possibly go wrong here....

I'm not sure what domains you work on, but in the domains where I work, showing these kind of results could possibly be very damaging to the business. There's nothing from stopping to build your own custom faker if you really need it, but let's keep Datafaker G or PG rated please.

bodiam commented 2 months ago

@snuyanzin @asolntsev if you're realllllllllly keen, you can always use Fallout quotes for the spam filtering:

https://github.com/datafaker-net/datafaker/blob/5a4aa0f8db734ded1b7a8869c6a2502623e69efc/src/main/resources/en/fallout.yml#L109

asolntsev commented 2 months ago

@bodiam Sorry, I don't understand how. To test spam filter, I need some provider that stably generate obscene words. Fallout quotes don't suite because only some of them contain obscene words. How can such a test stably work?

kingthorin commented 2 months ago

Just use a custom provider with your own yaml, then there are no concerns for the project.