PhilAppleby / ukbb-srmed

Text matching UK Biobank Self-Reported medication descriptions for medication classification
GNU Lesser General Public License v3.0
23 stars 4 forks source link

UK Biobank Self Reported Medication Data parsing and matching

Background

UK Biobank self reported medications are represented in UK Biobank as a list of codes per participant captured at the time of the Baseline Assessment Interview.

For example, participant id '000001' may have reported medications with UKBB codes 1140922174, 1140879424, 1140879616, 1197 and 2038460150.

These are described in the UKBB medication codes table as:

Code Description Report Count
1140922174 alendronate sodium 6380
1140879424 alverine 308
1140879616 amitriptyline 10119
1197 evening primrose oil product 1132
2038460150 paracetamol 100036

There is no structure in the data and no means of grouping medication into categories such as "Drugs for Diabetes" or "Drugs to control Asthma".

Aims

Write and test software to match terms in the UK Biobank Self Reported Medication data coding table with terms in both the Anatomical Therapeutic Chemical (ATC) classification system and in the British National Formulary (BNF) coding system. The overall aim being to assign higher level well-known codes to allow grouping of the data. The resulting matched data can then be used in conjuction with the UKBB medication codes assigned at assessment time to generate evidence for use in both individual clinical phenotypes and in ranges of clincal phenotypes as found in Phenome-Wide association studies (PheWAS).

Description

Matching code is written in Python (2.7 was initially used for development, changes for Python 3 have now been made). Extensive use of Bash shell wrappers is made to supply context at run time - code and data directory locations, data file names and database access parameters.

The key features to note are;

The following subdirectories can be found in the repository:

Top-level scripts are prefixed '01', '02', '03' and '04' for each classification system and call several lower level bash scripts. The '03' scripts are for code assignment to data for individual medication reports in UKBB, the path and filename for UKBB phenotype data for a project must be supplied via the sourced 'common' environment parameter file.

Running

Once the environment has been set up (see the note on the .../env/common file above), four scripts are run, with parameters for either ATC or BNF code assignment, through to phenotype generation (0,1) for PheWAS.

Script '01' runs data preparation steps, note that raw BNF data is not supplied in this repository due to potential licensing requirements. At this point ChEMBL synonyms and, in the case of the ATC-based match, ATC codes and terms are extracted from the ChEMBL database

Script '02' runs matching scripts to 'merge' in ChEMBL synonym data and produce data on matched and unmatched UKBB medication codes.

Script '03' calls scripts to extract medication detail data from the main UKBB phenotype csv file (the name of this varies by project and should be assigned in the .../env/common file via an environment variable) and then to assign the ATC or BNF codes output in step 02.

Script '04' calls scripts to annotate and generate PheWAS phenotypes as binary 1 (CASE) or 0 (CONTROL)

For ATC-based matching run:

For BNF-based matching run:

Flow Summary, steps 01 and 02

Notice

BNF coding system data was obtained and is presented in accordance with the “Open Government Licence for Public Sector Information” (http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/) and the NHS Business Services Authority (NHSBSA) “Terms and Conditions for Users” (https://www.nhsbsa.nhs.uk/our-policies/terms-and-conditions). Authors: PDA, ASFD and ERJ as users of this resource also make the following statement consistent with these terms and conditions: “NHSBSA BNF Classification Coding, NHSBSA Copyright 2019” This information is licenced under the terms of the Open Government Licence.