fzaninotto / Faker

Faker is a PHP library that generates fake data for you
MIT License
26.8k stars 3.57k forks source link

[Feature request] Add profanity and rude words generator #2070

Closed ebihimself closed 3 years ago

ebihimself commented 3 years ago

I wanted to request adding a rude words generator. The use case of the mentioned feature is something that is required for testing profanity detectors.

For me the interfaces will be something like:

$faker = Faker\Factory::create('en_US');

$faker->randomDirtyWords($numberOfWords = 2) ;  // ["Fuck", "Whore"]

$faker->dirtyText($length = 9, $numberOfRudeWords = 2) = // Roses are red violets are blue, fuck you asshole.

$faker->slugedDirtyWords($numberOfWords = 2);  // ["f_u_c_k", "f*ck"]

$faker->dirtySafeText($textLength = 11);  // Roses are red violets are blue, and we love each other.

In case this is a feature that is accepted for development I can implement it.

Nyholm commented 3 years ago

Could you elaborate why this is needed?

I mean, if you want to test profanity detectors, should you really use randomized data? You don’t want your tests to randomly fail.

ebihimself commented 3 years ago

Could you elaborate why this is needed?

As I mentioned through descriptions, I am trying to develop a codebase that detects dirty words from the text and perform some actions (filter out, slug them). For that purpose, I need to write tests against the detectors to ensure that the validator does its job correctly. after came across some approaches I came to faker and wanted to know does it make sense to have such a generator in this lib or not?

Nyholm commented 3 years ago

Sorry, I posted too quickly. I updated my post. I think you missed it:

I mean, if you want to test profanity detectors, should you really use randomized data? You don’t want your tests to randomly fail.

I am just a happy user of this project, I don’t decide anything, but I have a hard time understanding the scenario why this is needed.

If you want to test your profanity detector, then you should use a unit test with non-random data.

ebihimself commented 3 years ago

Could you elaborate why this is needed?

I mean, if you want to test profanity detectors, should you use randomized data? You don’t want your tests to randomly fail.

Yep. You are right. But actually, I'm not going to develop some unit tests that generate a random dirty word against the detectors and except to be false or true. indeed I'm going to develop a test that expects the detector to guess the euclidian distance of a normal text compared to a text contains dirty words. I want to generate some random word using the faker and I expect the detector returns a number let's say greater or equal to 0.41. so I can assume that it's working, otherwise needs more data to be trained enough.

Nyholm commented 3 years ago

Thank you for the context.

Isn’t the Euclidean distance something you measure and not guessing? If you are training a model one should not use randomized data. But I’m sure you have your reasons for doing this.

But I still fail to understand why this is a benefit for the library.

With no voting power, I’m 👎

ebihimself commented 3 years ago

Thank you for the context.

Isn’t the Euclidean distance something you measure and not guessing? If you are training a model one should not use randomized data. But I’m sure you have your reasons for doing this.

But I still fail to understand why this is a benefit for the library.

With no voting power,

By guess, I'm meant measurement. Also, I wanted to use randomize data to test the model, not training.

Anyway, thanks for your wisdom. I will close the issue as it seems this is not a good approach to the challange. I may need to look into it from another direction.