appeler / ethnicolr

Predict Race and Ethnicity Based on the Sequence of Characters in a Name
http://ethnicolr.readthedocs.io
MIT License
233 stars 65 forks source link
ethnicity lstm machine-learning names race

ethnicolr: Predict Race and Ethnicity From Name

.. image:: https://github.com/appeler/ethnicolr/workflows/test/badge.svg :target: https://github.com/appeler/ethnicolr/actions?query=workflow%3Atest .. image:: https://img.shields.io/pypi/v/ethnicolr.svg :target: https://pypi.python.org/pypi/ethnicolr .. image:: https://anaconda.org/soodoku/ethnicolr/badges/version.svg :target: https://anaconda.org/soodoku/ethnicolr/ .. image:: https://static.pepy.tech/badge/ethnicolr :target: https://www.pepy.tech/projects/ethnicolr

We exploit the US census data, the Florida voting registration data, and the Wikipedia data collected by Skiena and colleagues, to predict race and ethnicity based on first and last name or just the last name. The granularity at which we predict the race depends on the dataset. For instance, Skiena et al.' Wikipedia data is at the ethnic group level, while the census data we use in the model (the raw data has additional categories of Native Americans and Bi-racial) merely categorizes between Non-Hispanic Whites, Non-Hispanic Blacks, Asians, and Hispanics.

New Package With New Models in Pytorch

https://github.com/appeler/ethnicolr2

Streamlit App

https://ethnicolr.streamlit.app/

Caveats and Notes

If you picked a person at random with the last name 'Smith' in the US in 2010 and asked us to guess this person's race (as measured by the census), the best guess would be based on what is available from the aggregated Census file. It is the Bayes Optimal Solution. So what good are last-name-only predictive models for? A few things---if you want to impute race and ethnicity for last names that are not in the census file, infer the race and ethnicity in different years than when the census was conducted (if some assumptions hold), infer the race of people in different countries (if some assumptions hold), etc. The biggest benefit comes in cases where both the first name and last name are known.

Install

We strongly recommend installing ethnicolor inside a Python virtual environment (see venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>__)

::

pip install ethnicolr

Or

::

conda install -c soodoku ethnicolr

Notes:

General API

To see the available command line options for any function, please type in <function-name> --help

::

census_ln --help

usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input

Appends Census columns by last name

positional arguments: input Input file

optional arguments: -h, --help show this help message and exit -y {2000,2010}, --year {2000,2010} Year of Census data (default=2000) -o OUTPUT, --output OUTPUT Output file with Census data columns -l LAST, --last LAST Name of the column containing the last name

Examples

To append census data from 2010 to a file with column header in the first row <ethnicolr/data/input-with-header.csv>__, specify the column name carrying last names using the -l option, keeping the rest the same:

::

census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv

To predict race/ethnicity using Wikipedia full name model <ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb>__, specify the column name of last name and first name by using -l and -f flags respectively.

::

pred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv

Functions

We expose 6 functions, each of which either takes a pandas DataFrame or a CSV.

Application

To illustrate how the package can be used, we impute the race of the campaign contributors recorded by FEC for the years 2000 and 2010 and tally campaign contributions by race.

Data on race of all the people in the DIME data <https://data.stanford.edu/dime> is posted here <http://dx.doi.org/10.7910/DVN/M5K7VR> The underlying python scripts are posted here <https://github.com/appeler/dime_race>__

Data

In particular, we utilize the last-name--race data from the 2000 census <http://www.census.gov/topics/population/genealogy/data/2000_surnames.html>__ and 2010 census <http://www.census.gov/topics/population/genealogy/data/2010_surnames.html>, the Wikipedia data <ethnicolr/data/wiki/> collected by Skiena and colleagues, and the Florida voter registration data from early 2017.

Evaluation

  1. SCAN Health Plan, a Medicare Advantage plan that serves over 200,000 members throughout California used the software to better assess racial disparities of health among the people they serve. They only had racial data on about 47% of their members so used it to learn the race of the remaining 53%. On the data they had labels for, they found .9 AUC and 83% accuracy for the last name model.

  2. Evaluation on NC Data: https://github.com/appeler/nc_race_ethnicity

Authors

Suriyan Laohaprapanon, Gaurav Sood and Bashar Naji

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct <http://contributor-covenant.org/version/1/0/0/>__.

License

The package is released under the MIT License <https://opensource.org/licenses/MIT>__.