echr-od / ECHR-OD_process

Process to rebuild the European Court of Human Rights database and datasets from scratch
https://echr-opendata.eu
MIT License
21 stars 8 forks source link
european-court-of-human-rights open-data open-science

European Court of Human Rights OpenData construction process

This repository contains the scripts to build the database and datasets from the European Court of Human Rights OpenData (ECHR-OD) project. The purposes of such repository are many:

  1. Reproducibility: everyone can rebuild the entire database from scratch,

  2. Extensibility: any new version of the database must be created from a updated version of those scripts.

  3. Revision: all cases are automatically processed. There are many corner cases and such repository allow anyone to check the intermediate files to understand if the results are correct or not and locate the root cause of parsing errors.

DOWNLOAD DATA

Codacy Badge

General information

Following the project and getting help

Citing

If you are using the project, please consider citing:

@article{ECHRDB,
  title        = {On Integrating and Classifying Legal Text Documents},
  author       = {Quemy, A. and Wrembel, R.},
  year         = 2020,
  journal      = {International Conference on Database and Expert Systems Applications (DEXA)}
}

Versioning and deployment

There are two distinct type of versions:

  1. Semantic versioning (e.g. 2.0.1) that indicates the version of the process. It relates only to the code and the type of data available.

    1. major revision indicates a change in the type of version available
    2. minor and patches related concern bugfix and improvements
  2. Date of release (e.g. 2020-11-01), that indicates a when a build has been generated.

The database is meant to be updated every month with new cases. New releases are built upon an image created from the latest sources. Therefore, the date of release is technically enough to identify the semantic versioning. However, semantic versioning helps the maintainers and contributors with the development.

Installation & Usage

Recreating the database requires docker.

To build the environment image:

docker build -f Dockerfile -t echr_build .

As long as dependencies are not changed, there is no need to rebuild the image.

Once the image is built, the container help can be displayed with:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build -h

In particular, to build the database:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build build

Build, Steps & Workflow

The entrypoint of the Extract-transform-load (ETL) process is build.py.
The different ETL steps can be found in the subfolder echr/steps.

The main build script load a workflow made of steps and execute each of them. Workflows are YAML files and can be found in the folder workflows.

The workflows provided with the project are:

We have the following relations:

This separation have been made because generating the NLP model takes up 95% of the whole Release workflow time and a tremendous amount of RAM (>16 Go).

Workflows may define variables using uppercase name starting by $ (e.g. $MAX_DOCUMENTS). The variables are replaced during the build process using the following order of priority:

  1. Environment variable
  2. CLI parameter
  3. From the configuration file, under build.env.
  4. Global variable defined in build.py

Configuration

The general configuration file is config.yml and contains three parts:

  1. logging: related to logging files

  2. steps: configuration for each step on top of the workflow

  3. build: specific build configuration, in particular the section env contains the variables available to the whole workflow

Logs

There are two log files:

  1. The build log file: build/<build>/logs/build.html and build/<build>/logs/build.txt
  2. The process log file, mostly used for debug: logs/build.log

Tests & Coverage

To run the tests:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build test

Versions

Contributors