Closed kzollove closed 1 month ago
Jared loaded all Belgian place data into Nominatim, has workflow in place, will shift to one of target cities (likely start with Boston)
Geocoded needs to take place in a pre-step - will likely need to be its own containerized process
TODO: Brief written description on how the Tufts Syntegra database will be divided into parts to represent separate EHRs
Update from Jared: Geocoded a subset of Boston addresses using Nominatim; scripted, references docker images that downloads from Open Street Maps or makes direct calls to Nominatim API (for downloading the OSM files, not for geocoding the addresses)
Geocoding will be added to the containerization with option to use DeGauss in US or Nominatim for outside of the US
Hitting issues with the load_variable step; we will chat outside of meeting to resolve
TODO a high level explainer on how to produce fake geocoded locations to link to real PM2.5 data
load_exposure no longer working @kzollove
load_exposure no longer working @kzollove
Resolved
Need PM2.5 dataset in gaia-db @kzollove
Send Jared a list of city names and a single coordinate for each city represented in the SEDACS dataset @tibbben
Jared created locations for the cities that Tim sent. Using the Tufts syntegra dataset to create mock EHRs
Jared will add synthetic Location/ Location_history with links back to syntegra to GIS repo
The purpose of the GIS-specific synthetic data is to provide the foundation for an end-to-end demonstration that combines electronic health record (EHR) data in OMOP-CDM format with (1) datasets containing geospatial variables (e.g. average levels of pollutants in a region), and (2) terminology to capture those variables in an OMOP-CDM format.
This particular dataset is an augmented subset (~7000 individuals) of the full (~500k individuals) Tufts Synthetic Dataset, with a particular focus on individuals who have Chronic Obstructive Pulmonary Disorder (COPD). We conducted the augmentation using the gaiaCore toolchain developed by the GIS workgroup, combined with a location assignment approach that distributed a specified ratio of COPD vs non-COPD synthetic individuals to a subset of global cities with wide ranging values of PM2.5. Once assigned, we derived an EXPOSURE_OCCURRENCE table using pre-assigned 2B+ concept_id values that we were able to reference later in the downstream analytics workflow. The details of this augmentation process, including code and references, are described at length in the sections below.
The Tufts Synthetic Dataset is a set of completely synthetic electronic health record (EHR) data for approximately 500,000 fake patients. It was produced in 2021 through a collaboration between Syntegra, Inc. and Tufts Medical Center. A novel deep learning transformer model (the kind of model used by modern LLMs) developed by Syntegra was used to generate synthetic clinical data including data on visits, conditions, drugs, measurements, procedures, observation and device exposures. The model was trained on a version of the Tufts Research Data Warehouse (TRDW) that contained longitudinal EHR data on patients who received care at Tufts Medical Center. Both the TRDW training data and the synthetic data conform to version 5.3 of the OMOP common data model. Note that for the purposes of this GIS demonstration, we converted the data to OMOP version 5.4.
An expert determination of the HIPAA compliance of the dataset was conducted by Mirador Analytics Ltd. in 2022 (Report No SYN222P1a). It confirmed that the data are safe to share and to use without posing a risk to patient privacy. Analyses by Syntegra and Tufts Medical Center researchers confirmed the statistically realistic properties of the data through comparisons of descriptive statistics, treatment pathways and prediction models on the synthetic and real data. The data also contain realistic data quality errors. This realism makes the dataset a useful asset in training researchers to work with OMOP-shaped data in all phases of observational research from data quality assessment through analysis using the tools and practices of the Observational Health Data Science and Informatics (OHDSI) community and other OMOP-using communities. Its realism and format also make it useful for testing software designed to work with OMOP-shaped data, and for preparing study packages that use OHDSI tools to define and conduct full observational research studies.
The Annual PM2.5 Concentrations for Countries and Urban Areas, 1998-2016, consists of mean concentrations of particulate matter (PM2.5) for countries and urban areas (see manual for more details). The PM2.5 data are from the Global Annual PM2.5 Grids from MODIS, MISR and SeaWiFS Aerosol Optical Depth (AOD) with GWR, 1998-2016. The urban areas are from the Global Rural-Urban Mapping Project, Version 1 (GRUMPv1): Urban Extent Polygons, Revision 02, and its time series runs from 1998 to 2016. The country averages are population-weighted such that concentrations in populated areas count more toward the country average than concentrations in less populated areas, and its time series runs from 2008 to 2015.
While the analytic approach and associated results are described in more detail elsewhere, the general motivation for creating this dataset was to be able to support a patient-level prediction (PLP) model to predict the risk of COPD for a particular individual given pollutant levels in their city of residence together with their EHR data.
We first converted the Tufts Synthetic Data to OMOP version 5.4 using a set of SQL scripts against a Databricks instance. In the same schema, we inserted the GIS terminology into the standard OMOP vocabulary tables, and then filled the LOCATION, LOCATION_HISTORY, and EXPOSURE_OCCURRENCE tables as described in the subsections below.
In order to capture those patients fitting our desired COPD phenotype, we executed a simple cohort creation query that referenced a broad COPD concept (255573) and all of its descendants:
CREATE TABLE copd_cohort AS (
WITH copd_desc AS (
Select descendant_concept_id AS cid FROM concept_ancestor WHERE ancestor_concept_id = 255573),
copd_concepts AS (
SELECT concept_id, concept_name FROM condition_occurrence co
INNER JOIN copd_desc cd
ON co.condition_concept_id = cd.cid
INNER JOIN concept cn
ON cd.cid = cn.concept_id
GROUP BY concept_id, concept_name),
copd_patients AS (
SELECT DISTINCT person_id AS person_id FROM condition_occurrence co
INNER JOIN copd_concepts cc
ON cc.concept_id = co.condition_concept_id
)
SELECT p.* FROM copd_patients cp
INNER JOIN person p
ON p.person_id = cp.person_id
);
Note that we also integrated the GIS-specific synthetic data into an Atlas instance, and created/applied the cohort definition there as well.
We referred to two recent works that describe (1) the general, global prevalence of COPD as of 2023, and (2) the relationship between prevalence of COPD local concentrations of PM2.5:
From these two papers, we derived a very crude relationship between Odds Ratios (OR) of COPD versus concentration of PM2.5:
We then selected 20 cities evenly distributed along this concentration range using the medians of their 18 year annual concentration data in the PM2.5 dataset. With these 20 cities, we used the estimated OR relationship above to calculate a crude distribution of cases versus non-cases such that the total individuals with COPD in the Tufts Synthetic Dataset were all included. Note we pulled these cities directly from the PM2.5 data, and in that dataset they had already been assigned latitude and longitude point values; we carried those values through the rest of the processes instead of needing to geocode based on the city name.
CITY | COUNTRY | LATITUTE | LONGITUDE | MEDIAN PM2.5 | CASE | NOT CASE | OR |
---|---|---|---|---|---|---|---|
BULANDSHAHR | INDIA | 28.40449524 | 77.85832214 | 95.94 | 121 | 233 | 4.67 |
WANGDU | CHINA | 38.71282768 | 115.1666565 | 85.59 | 115 | 239 | 4.330 |
JINGHAI | CHINA | 38.93782806 | 116.9374886 | 74.41 | 109 | 244 | 4.02 |
SARSAWAN | INDIA | 30.00217158 | 77.34992132 | 63.93 | 100 | 253 | 3.56 |
DOKKHAMTAI | THAILAND | 19.17445183 | 99.96205373 | 55.26 | 88 | 265 | 2.99 |
LAHORE | PAKISTAN | 31.49387074 | 74.35156631 | 47.86 | 77 | 277 | 2.50 |
KINSHASA | CONGO DR | -4.397176266 | 15.33447051 | 35.82 | 53 | 300 | 1.59 |
JOHANNESBURG | SOUTH AFRICA | -26.17050266 | 28.0999918 | 33.24 | 50 | 303 | 1.49 |
KATOWICE | POLAND | 50.22116089 | 18.97915935 | 28.02 | 44 | 309 | 1.28 |
PAVIA | ITALY | 45.20035744 | 9.183955636 | 25.69 | 41 | 312 | 1.18 |
HUARMEY | PERU | -10.07883644 | -78.14238828 | 20.18 | 38 | 315 | 1.09 |
SCHOUWENDUIVELAND | NETHERLANDS | 51.64616013 | 3.924992681 | 17.24 | 38 | 315 | 1.09 |
FRESNO | USA | 36.69616127 | -119.6958351 | 15.47 | 38 | 315 | 1.09 |
BUFFALO | USA | 42.89597574 | -78.67471096 | 10.51 | 35 | 318 | 0.99 |
LAPLAYOSA | ARGENTINA | -32.09550285 | -63.03333855 | 4.98 | 35 | 318 | 0.99 |
PERTH | AUSTRALIA | -32.03704071 | 115.975156 | 2.14 | 32 | 321 | 0.90 |
FORKS | USA | 47.94616127 | -124.3791656 | 2.1 | 32 | 321 | 0.90 |
SITKA | USA | 57.07949448 | -135.3333359 | 0.7 | 29 | 324 | 0.81 |
ALEXANDRA | NEW ZEALAND | -45.24550247 | 169.4166565 | 0.6 | 29 | 324 | 0.81 |
SUVA | FIJI | -18.07050228 | 178.4999847 | 0.18 | 27 | 327 | 0.74 |
Once defining the case/non-case distribution numbers across the 20 locations, we set out to create a non-COPD individual sub sample from the Tufts Synthetic Dataset with an age and gender distribution aligned with the existing COPD individuals.
We calculated a rough age distribution of the entire COPD cohort, as well as a gender split, and used these values within a set of nested subqueries to capture a crudely representative cohort without COPD. We then placed these ~6500 non-COPD individuals together with the ~1100 COPD individuals, and then randomly assigned them to the different locations based on the ratios derived above. We've included an auto-generated SQL script to randomly select patients into a LOCATION_ASSIGNMENT table, which serves as a precursor to the LOCATION_HISTORY table below.
For the purposes of this demo, we made some general assumptions to simplify the creation of synthetic data and interpretation of downstream analyses:
Once creating the LOCATION_ASSIGNMENT table, we populated the LOCATION_HISTORY table with the following query:
CREATE OR REPLACE TABLE location_history AS (
SELECT l.location_id,
32848 AS relationship_type_concept_id,
1147314 AS domain_id,
la.person_id AS entity_id,
CASE
WHEN year_of_birth < 1998 THEN CAST('1998-01-01' AS DATE)
ELSE CAST(birth_datetime AS DATE) END AS start_date,
CAST('2016-12-31' AS DATE) AS end_date
FROM location l
INNER JOIN location_assignment la
ON l.location_id = la.location_id
INNER JOIN person p
ON la.person_id = p.person_id
WHERE year_of_birth < 2017
);
Note that the DDL and description of LOCATION_HISTORY can be found in the GIS Documentation.
We took two approaches to populate the EXPOSURE_OCCURRENCE table:
The gaiaCore containerized workflow is described in detail [elsewhere]( ); briefly, we added the PM2.5 dataset as a data source in GaiaDB, and then converted its contents to the geom/attr representations expected by the gaiaCore package. We've also copied the query for populating the EXPOSURE_OCCURRENCE table directly below:
INSERT INTO exposure_occurrence (
select
row_number() OVER (order by l.entity_id) AS exposure_occurrence_id
, l.location_id
, l.entity_id AS person_id
, 2052499839 AS exposure_concept_id
, TO_DATE(CONCAT(p.year, '-01-01'), 'yyyy-mm-dd') AS exposure_start_date
, TO_DATE(CONCAT(p.year, '-01-01'), 'yyyy-mm-dd')::timestamp AS exposure_start_datetime
, TO_DATE(CONCAT(p.year + 1, '-01-01'), 'yyyy-mm-dd') AS exposure_end_date
, TO_DATE(CONCAT(p.year + 1, '-01-01'), 'yyyy-mm-dd')::timestamp AS exposure_end_datetime
, 2052499878 AS exposure_type_concept_id
, 2052496943 AS exposure_relationship_concept_id
, 2052499839 AS exposure_source_concept_id
, p.PM_VALUE AS exposure_source_value
, 'WITHIN' AS exposure_relationship_source_value
, 'ug/m3' AS dose_unit_source_value
, 1 AS quantity
, CAST(NULL AS VARCHAR(50)) AS modifier_source_value
, 4172703 AS operator_concept_id
, p.PM_VALUE AS value_as_number
, CAST(NULL AS INTEGER) AS value_as_concept_id
, 32964 AS unit_concept_id
from pm25_limited p
inner join location_history l
on p.id = l.location_id
);
Note that we converted the 20-city subset of the PM2.5 data to a single, long-format table that is referenced in the query above.
With the addition of the EXPOSURE_OCCURRENCE table, the GIS-specific synthetic dataset in OMOP v5.4 format was integrated together with the GIS extension tables (LOCATION_HISTORY and EXPOSURE_OCCURRENCE) and a global GIS dataset related to urban pollution levels.
If you would like to access and download the GIS-specific synthetic COPD dataset, please contact Jared Houghtaling for more information about the associated data use agreement (DUA)!
Items to be completed in support of Synthetic Data generation in advance of the OHDSI Symposium
Only using locations at the city level; reference global PM2.5 data 1998-2016; synthetic patients take city locations
In workshop: brief description of the synthetic dataset, minimal detail on location generation (keyword minimal)
TODO: