Map OMOP data to `drug_prescriptions` table

razekmh commented 1 month ago

Extract data from the example OMOP data to fill the drug_prescriptions table from the validate article. This is a split from #17.

Please feel free to assign yourself to the issue. Please the respective branch for development.

zsenousy commented 1 month ago

R Script has been developed that maps drug_exposure to RAMSES drug_prescriptions. Synthetic Patient Data in OMOP has been used as OMOP public dataset.

The following steps provide instructions on accessing chunks from Drug_exposure and Concept tables required for the mapping process into RAMSES drug_prescriptions:

1- Load required libraries:

library(bigrquery)
library(DBI)
library(gargle)

2- Authenticate connection to BigQuery

sa_key_path <- ("path/to/your/key_file")

bq_auth(path = sa_key_path)

project_id <- “your_project_id”

con <- dbConnect(bigquery(), project = project_id, dataset = “bigquery-public-data”)

3- SQL queries for retrieving data from drug_exposure and concept tables:

sql_string1 <- “SELECT * FROM bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure LIMIT 10000”
sql_string2 <- "SELECT * FROM bigquery-public-data.cms_synthetic_patient_data_omop.concept WHERE domain_id IN ('Drug', 'Route', 'Unit')"
result1 <- dbGetQuery(con, sql_string1)
result2 <- dbGetQuery(con, sql_string2)

4- Save data

write.table(result1, file =  "path/to/your/drug_exposure_file.csv")
write.table(result2, file = "path/to/your/concept_file.csv")

Note: Concept table has been filtered on domain_id = {DRUG, ROUTE, and UNIT}.

zsenousy commented 1 month ago

The developed script performs the mapping and transformation of drug exposure data from the OMOP format into a RAMSES-compatible format. Let's go through the code block by block:

1. Loading Libraries

# Load necessary libraries
library(dplyr)
library(readr)
library(AMR)  # For drug name mapping

Purpose: This block loads the required libraries:
- dplyr: For data manipulation.
- readr: To read and write CSV files.
- AMR: For mapping drug names to standardised names using antimicrobial resistance data.

2. Loading Data

# Load data from CSV files
drug_exposure <- read_csv("/path/to/cleaned_drug_exposure.csv")
concept <- read_csv("path/to/cleaned_concept.csv")

Purpose: This block reads the cleaned data from CSV files into R. The data consists of two files:
- drug_exposure.csv: Contains information about patient drug exposure events.
- concept.csv: Contains concept information, including drug names, related to the drugs prescribed.

3. Print Column Names and Data Preview

# Print the column names to check if they are correct
print(colnames(concept))
# Print the cleaned data to verify everything looks correct
print(head(concept))

# Ensure the relevant columns are available in drug_exposure
print(colnames(drug_exposure))
print(head(drug_exposure))

Purpose: This block is used to verify that the data was loaded correctly and the columns match what is expected. It prints the column names and the first few rows of both the concept and drug_exposure datasets to ensure they are correct.

4. Joining Drug Data and Concept Table

#Mapping using left join
drug_exposure <- drug_exposure %>%
  left_join(concept, by = c("drug_concept_id" = "concept_id")) %>%
  rename(drug_name = concept_name) %>%  # Rename concept_name to drug_name
  mutate(route = NA)  # Since route_concept_id is NA.

Purpose: This block joins the drug_exposure table with the concept table based on matching drug_concept_id and concept_id. After joining, it renames the concept_name column to drug_name for clarity and initialises a route column with NA since the route_concept_id is missing in this dataset.

5. Removing Unnecessary Columns

# Remove unnecessary columns from concept table
drug_exposure <- drug_exposure %>%
  select(-domain_id, -vocabulary_id, -concept_class_id, -standard_concept, 
         -concept_code, -valid_start_date, -valid_end_date, -invalid_reason)

Purpose: This block removes unnecessary columns from the merged data. These columns (e.g., domain_id, vocabulary_id) are metadata and are not needed for further analysis or mapping.

6. Mapping `dose_unit_source_value` as Units

# Map 'dose_unit_source_value' directly as units (since 'dose_unit_concept_id' is missing)
if ("dose_unit_source_value" %in% colnames(drug_exposure)) {
  drug_exposure <- drug_exposure %>%
    mutate(units = dose_unit_source_value)  # Map 'dose_unit_source_value' to units
} else {
  drug_exposure$units <- NA  # If 'dose_unit_source_value' is missing, set units as NA
}

Purpose: This block checks if the column dose_unit_source_value exists. If it does, it creates a units column based on its values. If the column is missing, it sets units to NA. This is useful for ensuring dose units are captured or handled appropriately.

7. Mapping OMOP Fields to RAMSES Format

# Map drug_exposure fields to RAMSES fields
omop_to_ramses <- drug_exposure %>%
  transmute(
    # Mapping OMOP person_id to RAMSES patient_id
    patient_id = person_id,

    # Mapping OMOP drug_exposure_id to RAMSES prescription_id
    prescription_id = drug_exposure_id,

    # Start and end dates of drug exposure
    prescription_start = drug_exposure_start_date,
    prescription_end = drug_exposure_end_date,

    # Mapping drug_concept_id to RAMSES tr_DESC (drug description) using AMR package
    tr_DESC = ifelse(!is.na(AMR::ab_name(drug_name)), AMR::ab_name(drug_name), "Unknown drug"),

    # Route (e.g., IV, Oral) from concept table
    route = route,

    # Using 'quantity' as a proxy for dose if 'dose_value' is not available
    dose = quantity,

    # Units mapped from 'dose_unit_source_value'
    units = units,

    # Calculate duration between start and end dates in days
    duration_days = as.numeric(difftime(prescription_end, prescription_start, units = "days"))
  )

Purpose: This block transforms the drug_exposure table to a format compatible with the RAMSES model:
- patient_id: The person_id from OMOP is mapped to patient_id in RAMSES.
- prescription_id: drug_exposure_id from OMOP is mapped to prescription_id.
- prescription_start and prescription_end: Dates of drug exposure are mapped.
- tr_DESC: The drug_name column is mapped to a drug description using the AMR package. If the drug name is missing, it is labeled as "Unknown drug".
- route: Route information (initialised as NA) is included.
- dose: The quantity of drug exposure is used as the dose.
- units: The previously created units column is included.
- duration_days: The duration of drug exposure is calculated in days.

8. Displaying Final Data

# Display the final mapped data from OMOP to RAMSES
print(omop_to_ramses)

9. Validation Function

# Validation function for checking mappings
validate_mapping <- function(df) {
  if (all(!is.na(df$tr_DESC))) {
    message("All drugs successfully mapped to RAMSES fields!")
  } else {
    message("Some drug mappings failed. Please check the following:")
    print(df %>% filter(is.na(tr_DESC)))
  }
}

# Run the validation function
validate_mapping(omop_to_ramses)

Purpose: This block defines a validation function that checks if all drugs were successfully mapped to the tr_DESC field (i.e., no NA values in the tr_DESC column). If all drugs were mapped correctly, a success message is printed. If not, it identifies and prints the rows where drug mapping failed.

10. Saving the Final Data

# save the final mapped data to a CSV file
write_csv(omop_to_ramses, "./path/to/mapped_drug_prescriptions.csv")

Summary

The code starts by loading necessary libraries and data, verifying the structure of the data, and then transforming it from OMOP format to RAMSES format using joins, renaming, and custom mappings. It also includes validation to ensure successful mapping, and finally, it saves the transformed data.

The key aspects of the transformation involve:

Mapping drug concept IDs to drug names using the concept table.
Mapping dose units and calculating drug exposure duration.
Converting drug names to standard descriptions using the AMR package.
Exporting the transformed data for use in RAMSES.

zsenousy commented 1 month ago

Issues

During the process of mapping OMOP drug exposure data to RAMSES, several issues were encountered that led to incomplete or missing mappings. Notably, some drug standards were not mapped correctly, resulting in entries being labelled as "Unknown drug" in the final dataset. This was primarily due to:

Missing or incomplete drug_concept_id mappings.
Dependency on the AMR package for drug name resolution, which may not cover all drugs in the dataset, especially those that are non-antimicrobial.
Missing route_concept_id, dose_unit_concept_id, and other essential columns that could have provided more complete data for fields like route, dose, and units.

razekmh commented 1 month ago

Well done @zsenousy. This is great work. Would it be okay to push your code to the branch and resolve this issue. I think #27 could use a lot of the functions you built for this.

zsenousy commented 1 month ago

Pull request has been created for this code addition.

SAFEHR-data / ramses-package