Open allisonletts opened 2 years ago
Here's an example dataset: https://datahub.io/JohnSnowLabs/baby-names-by-sex-and-mother-ethnic-group
This one is a little challenging to address at this layer of the processing. I have a couple ideas, but really this is a faker issue that might be addressable in Snowfakery more gracefully.
Ideas that come to mind:
In the end it really should be something that gets handled in Faker, but I have no idea how that project's maintainer(s) respond to these kinds of suggestions.
Well, I just logged it in faker proper, so... we'll find out. My inclination is that if faker doesn't take it up, putting it in snowfakery would probably make sense.
I might mess with it for a bit and see if I can make any headway. I feel like anything would be better than what we have now. @prescod let me know if you have ideas or suggestions for where this all should live.
The proposed solution from the Faker maintainer would overlap with the need for a good way to bring in other libraries being discussed here.
I think at least the faker provider would need to be a community provider because I can't find a useful dataset that's based on government statistics.
Here's our current plan: we'll create our own name provider that starts with a broader range of names. The goal is to keep the original name datasets themselves outside of Snowfakery, although we might play with name distributions.
The names provided by the default en_US faker provider is the top 200 names from 1960s-1990s in the US, per the Social Security Administration. That data is going to skew verrrrry white. I'd like to be populating Salesforce with a dataset that sounds more diverse than that.