acrosman commented 2 years ago

To create a more diverse naming provider we need to carefully think through how to find, curate, and use lists of names. There are several known pitfalls to avoid including (there are certainly more):

Lists of names pulled from government records reflect social bias of their time. This includes factors like:
- Reflecting out of date demographics: some a minority population grows common names in that group may become more common by lag behind in a database that reflects name use over several decades.
- Reflect technology limitations: US name data used to drop non-latin alphabetic characters and force upper casing changing the names of millions of people in the database and then in practice.
- Reflect bias of government employees: US Immigration data is known to have changed people's names during the immigration process sometimes reverting later sometimes not.
Frequency based selection approaches will minimize the inclusion of various minority groups by definition.
Some communities emphasis consistent spelling more than others often further biasing frequency based approached.
The common gender use pattern of names can change one generation to the next for some names but not others (this may make a name a good candidate for nonbinary names).
Some communities, particularly refugees, often are both forced to switch their names and choose to make changes to assimilate.
Having unusually long or short names in a testing data set is very useful for testing even though those names may be less common in the general population.
Some communities are more attune to issues of inclusion and there may be benefits to users from those communities to see diversity represented – even if that means oversampling names from minority communities.

There will be no perfect solution and any solution will reflect the biases of the creators. However, we should be able to make progress and understand the biases our data set reflects and why. That will hopefully make it easier to improve further in the future.

prescod commented 2 years ago

Great thoughts!

Perhaps one way, maybe the only way to deal with the intrinsic bias is to be persona/scenario based.

For example: "June" is an SFDO Partner in Chicago demoing an NPSP extension package to a potential customer, "Luíza."

"June" wants her database to consist of names that would be familiar to New York City residents.

By stipulating all of this, we are introducing several biases, but at least they are explicit and give us a North Star to aim from. A problematic alternative is that everyone has a different Persona in mind (probably themself) and the whole thing is entirely subjective. "I recognize that name, I don't recognize THAT name, etc."

I propose that we aren't trying to be fair in the sense that everyone has an equally likely chance of seeing their name in the dataset. We are trying to be "realistic" in a subjective sense so that a minimum number of people are distracted by the biases in the dataset.

acrosman commented 2 years ago

Notes from May '22 Sprint:

Goals:

Increase overall representation of communities in US and CA via the EN US provider.
Be as clear and honest about biases reflected as we know how to be.
- Minimize bias when possible
Have as large list of names as possible.

What fields are we trying to keep diverse?

First name
Middle name
Last Name
Title

Ideas for new name lists:

Universities: particularly recent graduate lists from large public institutions
Community Organizations: particularly participant lists from recent years.
Determine the largest possible list of names that allow reasonable performance, and then randomly select from even larger list instead of common names. Rebuild every release of the fakers.
- This would offset many kinds of bias, but levels the playing field. For example it risks the data looking odd if there are too many names that are unique in the US but selected on such a list.
- Perhaps a combination of approaches, filter out names appearing less than something like 10 times and then select evenly.

Processes to ensure anonymity from those sources:

With name data separated in segments, gather by common names, test with spelling or soundx.
Request lists of names to be delivered separately (First in one file, Middle in one file, last in a 3rd file, etc.)

SFDO-Community-Sprints / faker_person_diverse

Discussion: Options for good naming schemes #2

Notes from May '22 Sprint:

Goals:

What fields are we trying to keep diverse?

Ideas for new name lists:

Processes to ensure anonymity from those sources: