Building a directed graph of family relationships from name data
The dataset consists of public records from a country in Latin America, scraped off a government website. Technically this isn't PII in the country of origin, but this repo will still treat it as such. In Latin America, one's legal/formal last name (or apellido) consists of two parts: the father's last name (which is passed on to children) and the mother's last name (which is not); unlike in the USA, women don't change their legal name upon marriage. This pattern gives us a chance to link together the records and establish a graph of family relationships. In addition to the citizen's full legal name, we also typically have entries for the parents (although these are often in "social" format, rather than the "legal" format). The goal of this repo is to parse the records, for each citizen identifying the apellidos, and linking each record to the parents.
For each citizen, we have the following information: | field | comment |
---|---|---|
cedula |
Basically a citizen ID number. Hashed for privacy. | |
nombre |
Citizen's name, almost always in the standard legal format | |
dt_birth |
||
nombre_padre |
||
nombre_madre |
||
dt_death |
||
marital_status |
Either SOLTERO, CASADO, DIVORCIADO, VIUDO (Single, Married, Divorced, Widowed) | |
dt_marriage |
||
nombre_spouse |
The standard legal format for names is: patronym matronym firstname middlename
. This contrasts with the "social format" which has the prenames first. The citizen's name is almost always in the legal format, but the other names vary widely. Sometimes the record will be in standard legal form, but it's also common for the name to be firstname patronym
(i.e., in social format, and only one surname).
Let's consider how the Simpsons family would look in this data. The family tree is:
Homer's parents are Abraham Simpson & Mona Olsen, so Homer's apellidos would be "Simpson Olsen", and his full name would be Homer Jay Simpson Olsen
Marge's parents are Clancy Bouvier & Jacqueline Gurney. In Spanish-style naming, Marge would retain her maiden name (Bouvier), and pass it on to her children. Socially, she might be known as "Marge de Simpson", but on legal documents she would be Marjorie Jacqueline Bouvier Gurney
.
nombre | dt_birth | nombre_padre | nombre_madre | marital_status | dt_marriage | nombre_spouse |
---|---|---|---|---|---|---|
Simpson Olsen Homer Jay | 1956/05/12 | Abe Simpson | Mona Olsen | CASADO | 1981/09/29 | Marge Bouvier |
Bouvier Gurney Marjorie Jacqueline | 1956/10/01 | Bouvier Clancy | Gurney Jacqueline | CASADO | 1981/09/29 | Simpson Homer |
Simpson Bouvier Bart Jojo | 1981/04/01 | Simpson Homer | Bouvier Marge | SOLTERO | ||
Simpson Bouvier Lisa Marie | 1983/05/09 | Homer Simpson | Marge Bouvier | SOLTERO | ||
Simpson Bouvier Margaret Evelyn | 1989/01/01 | Simpson Homer Jay | Bouvier Marjorie | SOLTERO |
There are obviously plenty of complications here: the order of apellidos and nombres isn't consistent, nicknames, etc. Real data includes these issues, as well as multi-part names (such as "Van der Graff"), substitutions ("Espinoza" vs "Espinosa"), dedications ("del Nino Jesus")
===============================
Basic packages:
conda create -n namegraph python=3 pandas ipython matplotlib unidecode fuzzywuzzy tqdm jupyter
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Project based on the cookiecutter data science project template. #cookiecutterdatascience