danparshall / namegraph

Other
1 stars 0 forks source link

namegraph

Building a directed graph of family relationships from name data

The dataset consists of public records from a country in Latin America, scraped off a government website. Technically this isn't PII in the country of origin, but this repo will still treat it as such. In Latin America, one's legal/formal last name (or apellido) consists of two parts: the father's last name (which is passed on to children) and the mother's last name (which is not); unlike in the USA, women don't change their legal name upon marriage. This pattern gives us a chance to link together the records and establish a graph of family relationships. In addition to the citizen's full legal name, we also typically have entries for the parents (although these are often in "social" format, rather than the "legal" format). The goal of this repo is to parse the records, for each citizen identifying the apellidos, and linking each record to the parents.

For each citizen, we have the following information: field comment
cedula Basically a citizen ID number. Hashed for privacy.
nombre Citizen's name, almost always in the standard legal format
dt_birth
nombre_padre
nombre_madre
dt_death
marital_status Either SOLTERO, CASADO, DIVORCIADO, VIUDO (Single, Married, Divorced, Widowed)
dt_marriage
nombre_spouse

The standard legal format for names is: patronym matronym firstname middlename. This contrasts with the "social format" which has the prenames first. The citizen's name is almost always in the legal format, but the other names vary widely. Sometimes the record will be in standard legal form, but it's also common for the name to be firstname patronym (i.e., in social format, and only one surname).

Example

Let's consider how the Simpsons family would look in this data. The family tree is: Chart taken from https://simpsons.fandom.com/wiki/Simpson_family?file=Simpsons_possible_family_tree.jpg

Homer's parents are Abraham Simpson & Mona Olsen, so Homer's apellidos would be "Simpson Olsen", and his full name would be Homer Jay Simpson Olsen

Marge's parents are Clancy Bouvier & Jacqueline Gurney. In Spanish-style naming, Marge would retain her maiden name (Bouvier), and pass it on to her children. Socially, she might be known as "Marge de Simpson", but on legal documents she would be Marjorie Jacqueline Bouvier Gurney.

nombre dt_birth nombre_padre nombre_madre marital_status dt_marriage nombre_spouse
Simpson Olsen Homer Jay 1956/05/12 Abe Simpson Mona Olsen CASADO 1981/09/29 Marge Bouvier
Bouvier Gurney Marjorie Jacqueline 1956/10/01 Bouvier Clancy Gurney Jacqueline CASADO 1981/09/29 Simpson Homer
Simpson Bouvier Bart Jojo 1981/04/01 Simpson Homer Bouvier Marge SOLTERO
Simpson Bouvier Lisa Marie 1983/05/09 Homer Simpson Marge Bouvier SOLTERO
Simpson Bouvier Margaret Evelyn 1989/01/01 Simpson Homer Jay Bouvier Marjorie SOLTERO

There are obviously plenty of complications here: the order of apellidos and nombres isn't consistent, nicknames, etc. Real data includes these issues, as well as multi-part names (such as "Van der Graff"), substitutions ("Espinoza" vs "Espinosa"), dedications ("del Nino Jesus")

===============================

Setup

Basic packages: conda create -n namegraph python=3 pandas ipython matplotlib unidecode fuzzywuzzy tqdm jupyter

Project Organization


├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience