facebookresearch / Clinical-Trial-Parser

Library for converting clinical trial eligibility criteria to a machine-readable format.
Apache License 2.0
163 stars 58 forks source link


Clinical Trial Parser

This is a library for parsing clinical trial eligibility criteria. It contains context-free grammar (CFG) and information extraction (IE) parser models, annotated word labeling data, medical word embeddings and vocabulary tools.

Table of contents


Clinical trials face multiple problems:

Due to these challenges, research is often slower and more biased than it should be.

Clinical trials use eligibility criteria to specify the participant population and to guarantee patient safety. The difficulty is to convert criteria to a machine-readable format. This library aims to reduce the amount of manual work needed to understand clinical trial eligibility, by extracting information algorithmically and using domain intelligence from the text itself (embeddings) and external expert data (vocabularies/ontologies).



Clinical Trial Parser relies on a combination of CFG and classic IE techniques to convert structured nominal, ordinal, and numerical requirements from eligibility criteria text. For details, see the architecture description.

Parser Diagram

Engine Diagram


The purpose of this library is to interpret and convert eligibility criteria to machine-readable relations. This way trials can easily be searched and discovered by their eligibility requirements.

To do this, inclusion (IC) and exclusion (EC) criteria sections are first extracted using regular expressions that identify the section headers. Next, the eligibility sections are split into individual criteria. Because the eligibility criteria must compute to one of the two values for each person – ‘eligible’ (yes) or ‘not eligible’ (no) – Boolean algebra can be used to express the eligibility logic. If an inclusion criterion is denoted by Ii and an exclusion criterion by Ei, the eligibility is computed as


where the effective inclusion criterion is defined as


A complete solution needs to consider a corner case where an exclusion criterion is actually an inclusion criterion. In this case, the negation cannot be applied. Fortunately, there are few such cases and they tend to affect only certain types of requirements.

The parser uses IE classifiers and CFG to convert individual eligibility criteria to machine-readable relations. Typically, both IE classifiers and CFG parsers are applied. Interpretations with confidence scores that are above a preset threshold are kept because a criterion may be a composite criterion built by joining multiple criteria together.

Variable is a convenient abstraction for interpreting eligibility criteria and defining machine-readable relations. It corresponds to a basic unit of clinical or demographic information that is extracted from a criterion and which determines a person’s eligibility.


Extracted relations on variables are formatted with the following JSON fields:

The parser splits extraction by variable type. It handles 3 types of variables:


The quality of the parser is measured by randomly sampling eligibility criteria from recruiting trials. The CFG model is estimated to parse ordinal and numerical requirements with the precision of ≥ 90% and recall ≥ 85% per implemented variable. The IE model is estimated to parse nominal requirements with the precision of approximately 44% for heart-condition related criteria. Although the precision of NER is estimated to be ~ 88%, the grounding of extracted entity mentions to medical concepts lowers the IE quality, because the differences between the eligibility criteria and the Medical Subject Headings (MeSH) vocabulary lead to imperfect NEL and because criteria are sometimes written ambiguously. The extracted concepts are grounded to about 6K medical variables.


This library works with Mac OS X or Linux. The developer guide describes how to set up the project and prepare the resources.


To build and test the CFG parser, run:

go build ./...
go test ./...

To train a new NER model, run:

pytext train < src/resources/config/ner.json

To test the NER model, run:

pytext test < src/resources/config/ner.json


The CFG parser can be run by executing:


The sample input and output of the script are clinical_trials.csv and cfg_parsed_clinical_trials.tsv.

The IE parser can be run by executing:


The sample input and output of the script are clinical_trials.csv and ie_parsed_clinical_trials.tsv.


Thanks to the Clinical Trials Transformation Initiative (CTTI) for providing the Aggregate Analysis of ClinicalTrials.gov (AACT) Database for the registered clinical studies at ClinicalTrials.gov.


Clinical Trial Parser is Apache 2.0 licensed, as found in the LICENSE file. Facebook assumes no responsibility for the resulting use of this library.