kzollove commented 3 months ago

Items to be completed in support of Synthetic Data generation in advance of the OHDSI Symposium

Consider using synthetic data with generated addresses in 3 locations (domestic & international)
- Consider Boston, Sao Paolo, China, and a European city

Only using locations at the city level; reference global PM2.5 data 1998-2016; synthetic patients take city locations

In workshop: brief description of the synthetic dataset, minimal detail on location generation (keyword minimal)

TODO:

kzollove commented 2 months ago

Jared loaded all Belgian place data into Nominatim, has workflow in place, will shift to one of target cities (likely start with Boston)

Geocoded needs to take place in a pre-step - will likely need to be its own containerized process

TODO: Brief written description on how the Tufts Syntegra database will be divided into parts to represent separate EHRs

kzollove commented 2 months ago

Update from Jared: Geocoded a subset of Boston addresses using Nominatim; scripted, references docker images that downloads from Open Street Maps or makes direct calls to Nominatim API (for downloading the OSM files, not for geocoding the addresses)

Geocoding will be added to the containerization with option to use DeGauss in US or Nominatim for outside of the US

Hitting issues with the load_variable step; we will chat outside of meeting to resolve

kzollove commented 2 months ago

TODO a high level explainer on how to produce fake geocoded locations to link to real PM2.5 data

kzollove commented 2 months ago

load_exposure no longer working @kzollove

kzollove commented 1 month ago

load_exposure no longer working @kzollove

Resolved

kzollove commented 1 month ago

Need PM2.5 dataset in gaia-db @kzollove

kzollove commented 1 month ago

Send Jared a list of city names and a single coordinate for each city represented in the SEDACS dataset @tibbben

kzollove commented 1 month ago

Jared created locations for the cities that Tim sent. Using the Tufts syntegra dataset to create mock EHRs

Jared will add synthetic Location/ Location_history with links back to syntegra to GIS repo

jshoughtaling commented 1 month ago

GIS-Specific Synthetic Dataset

Description

The purpose of the GIS-specific synthetic data is to provide the foundation for an end-to-end demonstration that combines electronic health record (EHR) data in OMOP-CDM format with (1) datasets containing geospatial variables (e.g. average levels of pollutants in a region), and (2) terminology to capture those variables in an OMOP-CDM format.

This particular dataset is an augmented subset (~7000 individuals) of the full (~500k individuals) Tufts Synthetic Dataset, with a particular focus on individuals who have Chronic Obstructive Pulmonary Disorder (COPD). We conducted the augmentation using the gaiaCore toolchain developed by the GIS workgroup, combined with a location assignment approach that distributed a specified ratio of COPD vs non-COPD synthetic individuals to a subset of global cities with wide ranging values of PM_2.5. Once assigned, we derived an EXPOSURE_OCCURRENCE table using pre-assigned 2B+ concept_id values that we were able to reference later in the downstream analytics workflow. The details of this augmentation process, including code and references, are described at length in the sections below.

Tufts Synthetic Data

The Tufts Synthetic Dataset is a set of completely synthetic electronic health record (EHR) data for approximately 500,000 fake patients. It was produced in 2021 through a collaboration between Syntegra, Inc. and Tufts Medical Center. A novel deep learning transformer model (the kind of model used by modern LLMs) developed by Syntegra was used to generate synthetic clinical data including data on visits, conditions, drugs, measurements, procedures, observation and device exposures. The model was trained on a version of the Tufts Research Data Warehouse (TRDW) that contained longitudinal EHR data on patients who received care at Tufts Medical Center. Both the TRDW training data and the synthetic data conform to version 5.3 of the OMOP common data model. Note that for the purposes of this GIS demonstration, we converted the data to OMOP version 5.4.

An expert determination of the HIPAA compliance of the dataset was conducted by Mirador Analytics Ltd. in 2022 (Report No SYN222P1a). It confirmed that the data are safe to share and to use without posing a risk to patient privacy. Analyses by Syntegra and Tufts Medical Center researchers confirmed the statistically realistic properties of the data through comparisons of descriptive statistics, treatment pathways and prediction models on the synthetic and real data. The data also contain realistic data quality errors. This realism makes the dataset a useful asset in training researchers to work with OMOP-shaped data in all phases of observational research from data quality assessment through analysis using the tools and practices of the Observational Health Data Science and Informatics (OHDSI) community and other OMOP-using communities. Its realism and format also make it useful for testing software designed to work with OMOP-shaped data, and for preparing study packages that use OHDSI tools to define and conduct full observational research studies.

PM_2.5 Dataset

The Annual PM_2.5 Concentrations for Countries and Urban Areas, 1998-2016, consists of mean concentrations of particulate matter (PM_2.5) for countries and urban areas (see manual for more details). The PM_2.5 data are from the Global Annual PM2.5 Grids from MODIS, MISR and SeaWiFS Aerosol Optical Depth (AOD) with GWR, 1998-2016. The urban areas are from the Global Rural-Urban Mapping Project, Version 1 (GRUMPv1): Urban Extent Polygons, Revision 02, and its time series runs from 1998 to 2016. The country averages are population-weighted such that concentrations in populated areas count more toward the country average than concentrations in less populated areas, and its time series runs from 2008 to 2015.

Analytic Use Case

While the analytic approach and associated results are described in more detail elsewhere, the general motivation for creating this dataset was to be able to support a patient-level prediction (PLP) model to predict the risk of COPD for a particular individual given pollutant levels in their city of residence together with their EHR data.

Data Processing

We first converted the Tufts Synthetic Data to OMOP version 5.4 using a set of SQL scripts against a Databricks instance. In the same schema, we inserted the GIS terminology into the standard OMOP vocabulary tables, and then filled the LOCATION, LOCATION_HISTORY, and EXPOSURE_OCCURRENCE tables as described in the subsections below.

COPD Cohort Definition

In order to capture those patients fitting our desired COPD phenotype, we executed a simple cohort creation query that referenced a broad COPD concept (255573) and all of its descendants:

CREATE TABLE copd_cohort AS (
WITH copd_desc AS (
    Select descendant_concept_id AS cid FROM concept_ancestor WHERE ancestor_concept_id = 255573),
  copd_concepts AS (
    SELECT concept_id, concept_name FROM condition_occurrence co
    INNER JOIN copd_desc cd
    ON co.condition_concept_id = cd.cid
    INNER JOIN concept cn 
    ON cd.cid = cn.concept_id
    GROUP BY concept_id, concept_name),
  copd_patients AS (
    SELECT DISTINCT person_id AS person_id FROM condition_occurrence co
    INNER JOIN copd_concepts cc
    ON cc.concept_id = co.condition_concept_id
  )
  SELECT p.*  FROM copd_patients cp
  INNER JOIN person p
  ON p.person_id = cp.person_id
);

Note that we also integrated the GIS-specific synthetic data into an Atlas instance, and created/applied the cohort definition there as well.

Location Assignment

We referred to two recent works that describe (1) the general, global prevalence of COPD as of 2023, and (2) the relationship between prevalence of COPD local concentrations of PM_2.5:

From these two papers, we derived a very crude relationship between Odds Ratios (OR) of COPD versus concentration of PM_2.5:

Estimated_OR

We then selected 20 cities evenly distributed along this concentration range using the medians of their 18 year annual concentration data in the PM_2.5 dataset. With these 20 cities, we used the estimated OR relationship above to calculate a crude distribution of cases versus non-cases such that the total individuals with COPD in the Tufts Synthetic Dataset were all included. Note we pulled these cities directly from the PM_2.5 data, and in that dataset they had already been assigned latitude and longitude point values; we carried those values through the rest of the processes instead of needing to geocode based on the city name.

CITY	COUNTRY	LATITUTE	LONGITUDE	MEDIAN PM2.5	CASE	NOT CASE	OR
BULANDSHAHR	INDIA	28.40449524	77.85832214	95.94	121	233	4.67
WANGDU	CHINA	38.71282768	115.1666565	85.59	115	239	4.330
JINGHAI	CHINA	38.93782806	116.9374886	74.41	109	244	4.02
SARSAWAN	INDIA	30.00217158	77.34992132	63.93	100	253	3.56
DOKKHAMTAI	THAILAND	19.17445183	99.96205373	55.26	88	265	2.99
LAHORE	PAKISTAN	31.49387074	74.35156631	47.86	77	277	2.50
KINSHASA	CONGO DR	-4.397176266	15.33447051	35.82	53	300	1.59
JOHANNESBURG	SOUTH AFRICA	-26.17050266	28.0999918	33.24	50	303	1.49
KATOWICE	POLAND	50.22116089	18.97915935	28.02	44	309	1.28
PAVIA	ITALY	45.20035744	9.183955636	25.69	41	312	1.18
HUARMEY	PERU	-10.07883644	-78.14238828	20.18	38	315	1.09
SCHOUWENDUIVELAND	NETHERLANDS	51.64616013	3.924992681	17.24	38	315	1.09
FRESNO	USA	36.69616127	-119.6958351	15.47	38	315	1.09
BUFFALO	USA	42.89597574	-78.67471096	10.51	35	318	0.99
LAPLAYOSA	ARGENTINA	-32.09550285	-63.03333855	4.98	35	318	0.99
PERTH	AUSTRALIA	-32.03704071	115.975156	2.14	32	321	0.90
FORKS	USA	47.94616127	-124.3791656	2.1	32	321	0.90
SITKA	USA	57.07949448	-135.3333359	0.7	29	324	0.81
ALEXANDRA	NEW ZEALAND	-45.24550247	169.4166565	0.6	29	324	0.81
SUVA	FIJI	-18.07050228	178.4999847	0.18	27	327	0.74

Case Sampling

Once defining the case/non-case distribution numbers across the 20 locations, we set out to create a non-COPD individual sub sample from the Tufts Synthetic Dataset with an age and gender distribution aligned with the existing COPD individuals.

We calculated a rough age distribution of the entire COPD cohort, as well as a gender split, and used these values within a set of nested subqueries to capture a crudely representative cohort without COPD. We then placed these ~6500 non-COPD individuals together with the ~1100 COPD individuals, and then randomly assigned them to the different locations based on the ratios derived above. We've included an auto-generated SQL script to randomly select patients into a LOCATION_ASSIGNMENT table, which serves as a precursor to the LOCATION_HISTORY table below.

Location History

For the purposes of this demo, we made some general assumptions to simplify the creation of synthetic data and interpretation of downstream analyses:

All individuals in the synthetic dataset lived within one of the 20 urban areas selected from the PM_2.5 dataset, either for the entire period of 1998-2016, or from their birthdate to 2016 if they were born after 1998-01-01. We did not represent any movement between locations.
Apart from maintaining a ratio of COPD to NON-COPD patients according to the estimated OR, the individuals were assigned to locations entirely at random

Once creating the LOCATION_ASSIGNMENT table, we populated the LOCATION_HISTORY table with the following query:

CREATE OR REPLACE TABLE location_history AS (
     SELECT l.location_id,
            32848                                     AS relationship_type_concept_id, 
            1147314                                   AS domain_id,
            la.person_id                              AS entity_id,
            CASE
                WHEN year_of_birth < 1998 THEN CAST('1998-01-01' AS DATE)
                ELSE CAST(birth_datetime AS DATE) END AS start_date,
            CAST('2016-12-31' AS DATE)                AS end_date
     FROM location l
              INNER JOIN location_assignment la
                         ON l.location_id = la.location_id
              INNER JOIN person p
                         ON la.person_id = p.person_id
     WHERE year_of_birth < 2017
 );

Note that the DDL and description of LOCATION_HISTORY can be found in the GIS Documentation.

Exposure Occurrence

We took two approaches to populate the EXPOSURE_OCCURRENCE table:

end-to-end workflow based on gaiaCore functionality enabled by the recent containerization work in GIS
simplified query to reference the appropriate concept_id value for annual average of PM_2.5, given this demonstration is only focused on a single variable.

The gaiaCore containerized workflow is described in detail [elsewhere]( ); briefly, we added the PM_2.5 dataset as a data source in GaiaDB, and then converted its contents to the geom/attr representations expected by the gaiaCore package. We've also copied the query for populating the EXPOSURE_OCCURRENCE table directly below:

INSERT INTO exposure_occurrence (
select 
     row_number() OVER (order by l.entity_id) AS exposure_occurrence_id
      , l.location_id
      , l.entity_id AS person_id
      , 2052499839 AS exposure_concept_id
      , TO_DATE(CONCAT(p.year, '-01-01'), 'yyyy-mm-dd') AS exposure_start_date
      , TO_DATE(CONCAT(p.year, '-01-01'), 'yyyy-mm-dd')::timestamp AS exposure_start_datetime
      , TO_DATE(CONCAT(p.year + 1, '-01-01'), 'yyyy-mm-dd') AS exposure_end_date
      , TO_DATE(CONCAT(p.year + 1, '-01-01'), 'yyyy-mm-dd')::timestamp AS exposure_end_datetime
      , 2052499878 AS exposure_type_concept_id 
      , 2052496943 AS exposure_relationship_concept_id
      , 2052499839 AS exposure_source_concept_id
      , p.PM_VALUE AS exposure_source_value
      , 'WITHIN' AS exposure_relationship_source_value
      , 'ug/m3' AS dose_unit_source_value
      , 1 AS quantity
      , CAST(NULL AS VARCHAR(50)) AS modifier_source_value
      , 4172703 AS operator_concept_id
      , p.PM_VALUE AS value_as_number
      , CAST(NULL AS INTEGER) AS value_as_concept_id
      , 32964 AS unit_concept_id
    from pm25_limited p
    inner join location_history l
    on p.id = l.location_id
);

Note that we converted the 20-city subset of the PM_2.5 data to a single, long-format table that is referenced in the query above.

With the addition of the EXPOSURE_OCCURRENCE table, the GIS-specific synthetic dataset in OMOP v5.4 format was integrated together with the GIS extension tables (LOCATION_HISTORY and EXPOSURE_OCCURRENCE) and a global GIS dataset related to urban pollution levels.

Access to Dataset

If you would like to access and download the GIS-specific synthetic COPD dataset, please contact Jared Houghtaling for more information about the associated data use agreement (DUA)!

OHDSI / GIS

[2024 Symposium] Synthetic Data #345

TODO:

GIS-Specific Synthetic Dataset

Description

Tufts Synthetic Data

PM_2.5 Dataset

Analytic Use Case

Data Processing

COPD Cohort Definition

Location Assignment

Case Sampling

Location History

Exposure Occurrence

Access to Dataset

OHDSI / GIS

[2024 Symposium] Synthetic Data #345

TODO:

GIS-Specific Synthetic Dataset

Description

Tufts Synthetic Data

PM2.5 Dataset

Analytic Use Case

Data Processing

COPD Cohort Definition

Location Assignment

Case Sampling

Location History

Exposure Occurrence

Access to Dataset

PM_2.5 Dataset