Version

SCALPEL-Extraction

SCALPEL-Extraction is a library part of the SCALPEL3 framework, resulting from a research Partnership between École Polytechnique & Caisse Nationale d'Assurance Maladie started in 2015 by Emmanuel Bacry and Stéphane Gaïffas. Since then, many research engineers and PhD students developped and used this framework to do research on SNDS data, the full list of contributors is available in CONTRIBUTORS.md. It provides concept extractors meant to fetch meaningful Medical Events & Patients from Système National des Données de Santé (SNDS) data. This library is based on Apache Spark. It reads flat data resulting from executing SCALPEL-Flattening on raw SNDS data, and then extracts Patients and Events in three steps:

1) Reading the flat data from the files generated by the flattening step; 2) Extracting "raw" events (such as Drug Dispensations, Diagnoses, Medical Acts, etc.) and convert them to Events; 3) Transforming the "raw" events into "processed" events (such as follow-up periods, molecule exposures, outcomes, etc.) and convert them to Events;

Extracted data can be easily be used to perform interactive data analysis using SCALPEL-Analysis.

Important remark : This software is currently in alpha stage. It should be fairly stable, but the API might still change and the documentation is partial. We are currently doing our best to improve documentation coverage as quickly as possible.

Usage

Build

To build a JAR from this repository, you need SBT v. 0.13.15 (Scala Build Tool) & the SBT assembly plugin. You just need to run the following commands:

git clone git@github.com:X-DataInitiative/SCALPEL-Extraction.git
cd SCALPEL-Extraction
sbt assembly

Input and Output

SCALPEL-Extraction reads flat data resulting from executing SCALPEL-Flattening, which is saved in Parquet or ORC. It should select the correct format in the configuration(example). read_file_format is used to set the format to read flat data.

Once a job is finished, the results are saved in the file system (local file system or HDFS) in Parquet format or ORC format. write_file_format is used to set the format to save the results. if the ORC format is set in the configuration files, it should add a special configuration in the spark-submit command spark.sql.orc.impl=native

Configuration

Right now, configurations are tied to "studies". As study can be seen as a sub-package, eventually containing custom extractors, a main class orchestrating the extraction, and a default configuration. For each study, a template configuration file containing the default values is defined. When running the study main class, if a parameter needs change, one just needs to copy this template file, edit it by uncommenting and modifying the appropriate lines, and passing it to spark-submit using the conf argument.

For example, the template configuration file for a study on elderly falls and several drug groups association is defined here. So, if one wants to override min_purchases, purchases_window and cancer_definition, they just need to create a copy of this file on the master server and uncomment these lines changing the appropriate values:

# Previous line stay commented...

# exposures.purchases_window: 0 months // 0+ (Usually 0 or 6) Represents the window size in months. Ignored when min_purchases=1.
# exposures.end_threshold_gc: 90 days  // If periodStrategy="limited", represents the period without purchases for an exposure to be considered "finished".
# exposures.end_threshold_ngc: 30 days // If periodStrategy="limited", represents the period without purchases for an exposure to be considered "finished".
exposures.end_delay: 30 days         // Length of period to add to the exposure end to delay it (lag).

# drugs.level: "Therapeutic"           // Options are Therapeutic, Pharmacological, MoleculeCombination
drugs.families: ["Antihypertenseurs", "Antidepresseurs", "Neuroleptiques", "Hypnotiques"]

# Next lines stay commented...

This file should then be stored with the results, to keep a trace of which configuration was used to generate a dataset. The commit number of the code used to extract events is included in SCALPEL-Extraction results (metadata file). As a result, the configuration file and the metadata should be enough to reproduce a dataset extraction.

Execution

The entry points for executing the extraction are study-specific, therefore within the study package. To steps to run the extraction for a given study are the following:

To start an extraction job, run the spark-submit command containing:

--total-executor-cores and --executor-memory arguments;
--class argument pointing to the study's main class;
the path to the jar file created by sbt assembly;
The name of the environment (i.e. the set default parameters of the study) to be used;
Eventually, a --conf parameter to override the environment parameters.

One can create an alias or script to make things easier. For example, for the Pioglitazone study, one could run the following shell script:

#!/bin/sh
spark-submit \
  --driver-memory 40G \
  --executor-memory 110G \
  --total-executor-cores 150 \
  --conf spark.task.maxFailures=20 \
  --class fr.polytechnique.cmap.cnam.study.pioglitazone.PioglitazoneMain \
  SCALPEL-Extraction-assembly-2.0.jar conf=./overrides.conf env=cmap

The Bulk Main

The Bulk Main is a special study that transforms all the SNDS data into our normalized format based on the Event class. It is intended to simplify the complexity of the SNDS and ease the statistical analysis. The extractors available in the Bulk are listed here.

The steps to use the Bulk Main are:

Add a file under the directory src/main/resources/config/bulk/paths, for example, my_env.conf.
In my_env.conf, add the links to your flattened SNDS data. See cmap.env for an example.
In the file src/main/resources/config/bulk/default.conf, add the following
```
my_env = ${root} {
include "paths/my_env.conf"
}
```
Build you JAR.

Using shell, execute the following script

spark-submit \
 --total-executor-cores 160 \
 --executor-memory 18G \
 SCALPEL-Extraction-assembly-2.0.jar env=my_env

Use the resulting metadata.json to load and analyse your data using SCALPEL-Analysis.

Implemented classes for data extraction

Extractors

Our package offers ready-to-use Extractors intending to extract events from raw, flat SNDS data and to output them into a simpler, normalized data format. For more details, please read the companion paper.

Extractor	SNDS data Source	Description
Act	MCO	Extracts CCAM acts available PMSI-MCO (Public hospitalization)
Act	DCIR	Extracts CCAM acts available DCIR (One day & liberal acts)
Act	MCO-ACE	Extracts CCAM acts available MCO-ACE
Main Diagnosis	MCO	Extracts the main diagnosis coded in available PMSI-MCO in CIM-10 format
Linked Diagnosis	MCO	Extracts the linked diagnosis coded in available PMSI-MCO in CIM-10 format
Associated Diagnosis	MCO	Extracts the associated diagnosis coded in available PMSI-MCO in CIM-10 format
Long Term Diseases	IMB	Extracts Long term diseases ('ALD' in French) diagnosis available IMB
Hospital stay	MCO	Extracts hospital stays
Main Diagnosis	MCO	Extracts the main diagnosis coded in available PMSI-MCO in CIM-10 format
Drug Purchases	DCIR	Extracts drug purchases available in DCIR. Can be adjusted to any the desired level: CIP13, molecule, Pharma class & Thera class.
Patients	Mainly IR-BEN-R	Extracts available patients with their gender, birth date, and eventual death date

Transformers

Transformers combine multiple events produced by the extractors to produce higher-level, more complex Events. Although they are often study-specific, the transformers code can be configured up to some point. If they are not flexible enough for your use-case, they can be used as a good starting point for your custom implementations.

Transformer	Combines	Description
FollowUp	Multiple	Combines multiple events to define a follow-up period per patient.
Exposures	Drug Purchases & Followup	Combines drugs purchases events to create Exposures. Offer multiple strategies such as Limited & Unlimited exposure duration.
Outcome	Acts & Diagnosis	Creates complex outcomes based on Acts & Diagnosis data. Our list includes Fractures, Heart Failure, Infarctus, Bladder cancer.

Citation

If you use a library part of SCALPEL3 in a scientific publication, we would appreciate citations. You can use the following bibtex entry:

@article{bacry2020scalpel3,
  title={SCALPEL3: a scalable open-source library for healthcare claims databases},
  author={Bacry, Emmanuel and Gaiffas, St{\'e}phane and Leroy, Fanny and Morel, Maryan and Nguyen, Dinh-Phong and Sebiat, Youcef and Sun, Dian},
  journal={International Journal of Medical Informatics},
  pages={104203},
  year={2020},
  publisher={Elsevier}
}

Contributing

Coding guidelines

SCALPEL-Extraction is implemented in Scala 2.11.12 with Spark 2.3 and HDFS (Hadoop 2.7.3).

The code should follow the Databricks Scala Style Guide, (which relies on Scala Style Guide). You can use linters in your IDE (for instance scalastyle, shipped by default with IntelliJ) to help you comply with these stylesheets.

Also we should follow, as much as possible, clean code best practices (for instance, or the Clean Code). Among them :

Meaningful variable name
Methods should be small and only do one thing
Avoid useless complexity

Imports

Our imports are based on the style suggested at the following link, with a few modifications.

Therefore, every contributor should update their IDE accordingly. On IntelliJ: (Settings > Editor > Code Style > Scala > Imports), we use the following importing order:

java
javax
scala
all other imports
org.apache.spark
fr.polytechnique

X-DataInitiative / SCALPEL-Extraction

readme