SFDO-Community-Sprints / Snowfakery-Recipe-Templates

Repo for all contributed Snowfakery recipes, maintained by Data Gen Toolkit team.
BSD 3-Clause "New" or "Revised" License
27 stars 8 forks source link

Add diversity to person names #29

Open allisonletts opened 2 years ago

allisonletts commented 2 years ago

The names provided by the default en_US faker provider is the top 200 names from 1960s-1990s in the US, per the Social Security Administration. That data is going to skew verrrrry white. I'd like to be populating Salesforce with a dataset that sounds more diverse than that.

allisonletts commented 2 years ago

Here's an example dataset: https://datahub.io/JohnSnowLabs/baby-names-by-sex-and-mother-ethnic-group

acrosman commented 2 years ago

This one is a little challenging to address at this layer of the processing. I have a couple ideas, but really this is a faker issue that might be addressable in Snowfakery more gracefully.

Ideas that come to mind:

  1. We could add a recipe for people who want a work around that uses an approach like the international recipe does to pull in names from a wider set of faker locales to get more diversity that way. It would mean that names would not get their translations into common English forms and might create as many issues as it solves.
  2. We could move this issue Snowfakery to the encourage the creation of a name processor that does a version of the first idea but more formally and more stably than a work around in a recipe would. @prescod is that a reasonable conversation to have there?
  3. We could create a plugin for Snowfakery and/or Faker that replaces the existing one for people who wanted a different set of names. This would be similar to the nonprofit name generator I created a few months ago or your higher ed faker service. That's a little distracting for a project that's still figuring out how to get a test runner setup at all, but possible. Might also be the kind of thing that happens independently from this project and this project consumes.

In the end it really should be something that gets handled in Faker, but I have no idea how that project's maintainer(s) respond to these kinds of suggestions.

allisonletts commented 2 years ago

Well, I just logged it in faker proper, so... we'll find out. My inclination is that if faker doesn't take it up, putting it in snowfakery would probably make sense.

I might mess with it for a bit and see if I can make any headway. I feel like anything would be better than what we have now. @prescod let me know if you have ideas or suggestions for where this all should live.

acrosman commented 2 years ago

The proposed solution from the Faker maintainer would overlap with the need for a good way to bring in other libraries being discussed here.

allisonletts commented 2 years ago

I think at least the faker provider would need to be a community provider because I can't find a useful dataset that's based on government statistics.

allisonletts commented 2 years ago

Here's our current plan: we'll create our own name provider that starts with a broader range of names. The goal is to keep the original name datasets themselves outside of Snowfakery, although we might play with name distributions.