CLARIAH / burgerLinker

Command line tool for linking civil registries
MIT License
14 stars 1 forks source link

burgerLinker - Civil Registries Linking Tool

Further details regarding the data standardisation and the data model are available in the burgerLinker Wiki or via the burgerLinker lecture.

Purpose

This tool is being developed to improve and replace the current LINKS software. Points of improvement are:

To download the latest version of the tool click releases on the right of the screen.

Use case

Historians use archival records to describe persons' lives. Each record (e.g. a marriage record) just describes a point in time. Hence historians try to link multiple records on the same person to describe a life course. This tool focuses on "just" the linkage of civil records. By doing so, pedigrees of humans can be created over multiple generations for research on social inequality, especially in the part of health sciences where the focus is on gene-social contact interactions.

User profile

The software is designed for the so called "digital historians" (e.g. humanities scholars with basic command line skills) who are interested in using the Dutch civil registries for their studies, or for linking their data to it.

Data

In its current version, the tool cannot be used to match entities from just any source. The current tool is solely focused on the linkage of civil records, relying on the sanguineous relations on the civil record, modelled according to our Civil Registries schema. An overview of the Civil Registries schema is available as a PNG file, and you can browse it on Druid.

Previous work

So far, (Dutch) civil records have been linked by bespoke programming by researchers, sometimes supported by engineers. Specifically the IISG-LINKS program has a pipeline to link these records and provide them to the Central Bureau of Genealogy (CBG). Because the number of records has grown over time and the IISG-LINKS takes an enormous amount of time (weeks) to LINK all records currently present, burgerLinker is designed to do this much faster (full sample takes less than 48 hours).

The Golden Agents project has brought about Lenticular Lenses a tool designed to link persons across sources of various nature. We have engaged with the Lenticular Lenses team on multiple occasions (a demo-presentation, two person-vocabulary workshops, and a specific between-teams-workshop). From those meetings we have adopted the ROAR vocabulary for work in CLARIAH-WP4. On the specific burgerLinker and lenticular lenses tool, however we found that the prerequisite in Lenticular Lenses to allow for heterogenous sources, conflicted with the burgerLinker prerequisite to be fast: one reason for it to be fast is the limited set of sources that burgerLinker allows for.

The only other set of initiatives that we are aware of are bespoke programming initiatives by domain specific researchers, with country and time specific rules for linking in for example R. These linkage tools are on the whole slow. What we did do is make our own rule set for linking modular, to allow in the future for country and time specific rule sets to be incorporated in burgerLinker.

Update At the ESSHC 2023 we learned of population-linkage and hope to set up talks to discuss the similarities and differences in our approaches. Also at the ESSHC 2023, we learned of the Norwegian effort for historical record linking: https://github.com/uit-hdl/rhd-linking (for documentation see: https://munin.uit.no/handle/10037/28399).


Operating Systems

Installation requirements

Input requirements

Output format

Two possible output formats to represent the detected links:

Main dependencies

This tool mainly rely on two open-source libraries:

Tool functionalities

Functionalities that are supported in the current version: (case insensitive)

Tool parameters

Parameters that can be provided as input to the linking tool:


Examples

java -jar burgerLinker.jar --help


java -jar burgerLinker.jar --function ConvertToHDT --inputData dataDirectory/myCivilRegistries.nq --outputDir .

This will generate the HDT file 'myCivilRegistries.hdt' and its index 'myCivilRegistries.hdt.index' in the same directory. The index should be kept in the same directory of the HDT file to speed up all queries.

:warning:

This is the most memory-intensive step of the tool. Therefore, for avoiding running out of memory for larger datasets, we recommend (i) running this step on a machine with enough memory, and (ii) changing the initial lower bound and upper bound of the JAVA heap memory size, by adding the -Xms and -Xmx flags.

As an example, here are the flags used for generating the HDT file of all Dutch birth and marriage certificates:

java -Xms64g -Xmx96g -jar burgerLinker.jar --function ConvertToHDT --inputData dataDirectory/myCivilRegistries.nq --outputDir .


java -jar burgerLinker.jar --function ConvertToHDT --inputData dataDirectory/hdt1.hdt,dataDirectory/hdt2.hdt --outputDir .

This will generate a third HDT file 'merged-dataset.hdt' and its index 'merged-dataset.hdt.index' in the same directory.

:warning:

The two HDT files given as input are only separated by , (without empty space)


java -jar burgerLinker.jar --function Between_B_M --inputData dataDirectory/myCivilRegistries.hdt --outputDir . --format CSV --maxLev 3 --fixedLev

These arguments indicate that the user wants to:

[Between_B_M] link parents of newborns in Birth Certificates to brides and grooms in Marriage Certificates,
[dataDirectory/myCivilRegistries.hdt] in the civil registries dataset myCivilRegistries.hdt modelled according to our civil registries RDF schema,
[.] save the detected links in the current directory,
[CSV] as a CSV file,
[3] allowing a maximum Levenshtein of 3 per name (first name or last name),
[fixedLev] independently from the length of the name.

java -jar burgerLinker.jar --function closure --inputData dataDirectory/myCivilRegistries.hdt --outputDir myResultsDirectory

This command computes the transitive closure of all links existing in the directory myResultsDirectory, and generates a new finalDataset.nt.gz dataset in this directory by replacing all matched individuals' identifiers from the myCivilRegistries.hdt input dataset with the same unique identifier.

How?

The directory myResultsDirectory must contain the CSV files that resulted from the linking functions described above, without changing the file names (the tool finds these files using a regular expression search in this directory). It can contain one, or all of the following CSV files, with X being any integer from 0 to 4:

The function will first transform the links in these CSV files, that are asserted between identifiers of certificates, into links between individuals. Since identity links are transitive and symmetric, this function computes the transitive closure of all these transformed individual links, and generates new identifiers for each resulted equivalence class.

Example:

This means that all these identifiers (:newborn1, :bride1, and :mother1) refer to the same individual, appearing in different roles in different civil certificates. This function generates a new dataset, replacing all occurrences of these three identifiers with a single unique identifier (e.g. :i-1). This process allows the reconstruction of historical families, without the need of writing complex queries or following a large number of identity links across the dataset.

Convert a file hkh-maids.nt to HDT java -jar burgerLinker.jar --function convertToHDT --inputData maids/maids-dataset/maids.nt --outputDir maids/maids-dataset/

Merge the resulting HDT dataset of hkh-maids to the HDT file of the marriages: nohup java -Xms128g -Xmx192g -jar burgerLinker.jar --function convertToHDT --inputData maids/maids-dataset/maids.hdt,civ-reg-2021/HDT/marriages.hdt --outputDir maids/maids-and-marriages-dataset/ &

Run Within_B_M with the singleInd flag on the resulted mergedDataset: nohup java -Xms128g -Xmx192g -jar burgerLinker.jar --function within_B_M --inputData maids/maids-and-marriages-dataset/merged-dataset.hdt --outputDir maids/results/ --maxLev 1 --ignoreDate --singleInd &

Links are saved in the following CSV file (around 100K links detected with the above parameters): maids/results/within_b_m-maxLev-1-singleInd-ignoreDate/results/within-B-M-maxLev-1-singleInd-ignoreDate.csv

NB: when running burgerLinker with nohup, the progress of the linking is saved in the nohup.out file. You can track the progress using tail -f : tail -f nohup.out.


Post-processing rules

Date filtering assumptions


Possible direct extensions

It would be possible to add more general matching functionalities that are not dependent on the Civil Registries schema. One possible way would be to provide a JSON Schema as an additional input to any given dataset, specifying the (i) Classes that the user wish to match their instances (e.g. sourceClass: iisg:Newborn ; targetClass: iisg:Groom), and the (ii) Properties that should be considered in the matching (e.g. schema:givenName; schema:familyName).

Subsequently, the fast matching algorithm could be used for many other linkage purposes (in Digital Humanities), e.g. places, occupations and products.