OHDSI / ETL-Synthea

A package supporting the conversion from Synthea CSV to OMOP CDM
https://ohdsi.github.io/ETL-Synthea/
98 stars 71 forks source link

R script needs lubridate casting in synthea/native and allowing NULLs in concept table? #13

Closed turbomam closed 5 years ago

turbomam commented 5 years ago

Thanks for releasing this tool.

I'm finding that I have to make some modifications in order to get the R code (and embedded SQL) to run. Is this surprising?

I'm using Windows 10, Postgres 11, and R 3.5.0

I created my synthea data set from GitBash like this:

./run_synthea -s 42 -p 1000 Massachusetts

So far I have modified LoadSyntheaTables to forcibly cast dates, and to cast UTILIZATION to numeric, not character. See below. The native synthea tables load now.

I also had to allow NULLs in concept.concept_name, concept.concept_code, concept_synonym.concept_synonym_name, and concept_relationship.valid_start_date. Now I can run all of the steps up to LoadEventTables, which give an error. I have substituted LoadCDMTables for that. See #14

Modified LoadSyntheaTables

  function (connectionDetails,
            syntheaDatabaseSchema,
            syntheaFileLoc) {
    csvList <-
      c(
        "allergies.csv",
        "conditions.csv",
        "imaging_studies.csv",
        "medications.csv",
        "organizations.csv",
        "procedures.csv",
        "careplans.csv",
        "encounters.csv",
        "immunizations.csv",
        "observations.csv",
        "patients.csv",
        "providers.csv"
      )
    conn <- DatabaseConnector::connect(connectionDetails)
    for (csv in csvList) {
      syntheaTable <- data.table::fread(
        file = paste0(syntheaFileLoc,
                      "/", csv),
        stringsAsFactors = FALSE,
        header = TRUE,
        sep = ","
      )
      writeLines(paste0("Loading: ", csv))
      if ("CODE" %in% colnames(syntheaTable))
        syntheaTable$CODE <- as.character(syntheaTable$CODE)
      if ("REASONCODE" %in% colnames(syntheaTable))
        syntheaTable$REASONCODE <-
        as.character(syntheaTable$REASONCODE)
      if ("PHONE" %in% colnames(syntheaTable))
        syntheaTable$PHONE <- as.character(syntheaTable$PHONE)

      # if ("UTILIZATION" %in% colnames(syntheaTable))
      #   syntheaTable$UTILIZATION <-
      #   as.character(syntheaTable$UTILIZATION)

      if ("UTILIZATION" %in% colnames(syntheaTable))
        syntheaTable$UTILIZATION <-
        as.numeric(as.character(syntheaTable$UTILIZATION))

      print(tableName <- paste0(syntheaDatabaseSchema,
                                ".", strsplit(csv, "[.]")[[1]][1]))
      print(head(syntheaTable))

      if ("START" %in% colnames(syntheaTable)) {
        syntheaTable$START <- lubridate::as_date(syntheaTable$START)
      }
      if ("STOP" %in% colnames(syntheaTable)) {
        syntheaTable$STOP <- lubridate::as_date(syntheaTable$STOP)
      }
      if ("DATE" %in% colnames(syntheaTable)) {
        syntheaTable$DATE <- lubridate::as_date(syntheaTable$DATE)
      }

      if ("BIRTHDATE" %in% colnames(syntheaTable)) {
        syntheaTable$BIRTHDATE <- lubridate::as_date(syntheaTable$BIRTHDATE)
      }

      if ("DEATHDATE" %in% colnames(syntheaTable)) {
        syntheaTable$DEATHDATE <- lubridate::as_date(syntheaTable$DEATHDATE)
      }

      DatabaseConnector::insertTable(
        conn,
        tableName,
        data = as.data.frame(syntheaTable),
        dropTableIfExists = FALSE,
        createTable = FALSE,
        progressBar = TRUE
      )
    }
    on.exit(DatabaseConnector::disconnect(conn))
  }

Some warnings

Loading: concept_relationship.csv ...snip... 100%

1: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, : Found and resolved improper quoting out-of-sample. First healed line 50165: <<44822272 "Light-for-dates"without mention of fetal malnutrition, 1,250- 1,499 grams 4180186>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

2: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, : Found and resolved improper quoting out-of-sample. First healed line 64070: <<2618083 "opt out" physician or practitioner emergency or urgent service Observation HCPCS HCPCS Modifier S GJ 19981001 20991231 >>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning. 3: In max(nchar(as.character(obj)), na.rm = TRUE) : no non-missing arguments to max; returning -Inf

and

Running: final_visit_ids.sql ...snip... 100% ... Executing SQL took 0.296 secs

In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema) : Parameter 'synthea_schema' not found in SQL

Selected OMOP vocabularies

vocabulary_id vocabulary_name vocabulary_reference vocabulary_version vocabulary_concept_id
ABMS Provider Specialty (American Board of Medical Specialties) http://www.abms.org/member-boards/specialty-subspecialty-certificates 2018-06-26 ABMS 45756746
ATC WHO Anatomic Therapeutic Chemical Classification FDB UK distribution package RXNORM 2018-08-12 44819117
CDM OMOP Common DataModel https://github.com/OHDSI/CommonDataModel CDM v6.0.0 32485
Cohort Type OMOP Cohort Type OMOP generated 44819234
Concept Class OMOP Concept Class OMOP generated 44819233
Condition Type OMOP Condition Occurrence Type OMOP generated 44819127
Cost OMOP Cost OMOP generated 581457
Cost Type OMOP Cost Type OMOP generated 5029
CPT4 Current Procedural Terminology version 4 (AMA) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 2018 Release 44819100
Death Type OMOP Death Type OMOP generated 44819135
Device Type OMOP Device Type OMOP generated 44819151
dm+d Dictionary of Medicines and Devices (NHS) https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/24/releases dm+d Version 6.0.0 20180604 232
Domain OMOP Domain OMOP generated 44819147
Drug Type OMOP Drug Exposure Type OMOP generated 44819126
EphMRA ATC Anatomical Classification of Pharmaceutical Products (EphMRA) http://www.ephmra.org/Anatomical-Classification EphMRA ATC 2016 243
Episode OMOP Episode OMOP generated 32523
Episode Type OMOP Episode Type OMOP generated 32542
Ethnicity OMOP Ethnicity OMOP generated 44819134
GCN_SEQNO Clinical Formulation ID (FDB) FDB US distribution package 20151119 Release 44819141
Gender OMOP Gender OMOP generated 44819108
GGR Commented Drug Directory (BCFI) http://www.bcfi.be/nl/download GGR 20180901 581450
HCPCS Healthcare Common Procedure Coding System (CMS) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 20180101 Alpha Numeric HCPCS File 44819101
ICD10 International Classification of Diseases, Tenth Revision (WHO) http://www.who.int/classifications/icd/icdonlineversions/en/ 2016 Release 44819124
ICD10CM International Classification of Diseases, Tenth Revision, Clinical Modification (NCHS) http://www.cdc.gov/nchs/icd/icd10cm.htm ICD10CM FY2019 code descriptions 44819098
ICD10PCS ICD-10 Procedure Coding System (CMS) http://www.cms.gov/Medicare/Coding/ICD10/index.html ICD10PCS 2019 44819125
ICD9CM International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 1 and 2 (NCHS) http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html ICD9CM v32 master descriptions 5046
ICD9Proc International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html ICD9CM v32 master descriptions 44819099
ICDO3 International Classification of Diseases for Oncology, Third Edition (WHO) https://seer.cancer.gov/icd-o-3/ ICDO3 SEER Site/Histology Released 09/18/2015 581426
LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) http://loinc.org/downloads/loinc 2.65 44819102
Meas Type OMOP Measurement Type OMOP generated 44819152
Medicare Specialty Medicare provider/supplier specialty codes (CMS) http://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/MedicareProviderSupEnroll/Taxonomy.html 2018-06-26 Specialty 44819138
MEDRT Medication Reference Terminology MED-RT (VA) https://nciterms.nci.nih.gov/ncitbrowser/pages/vocabulary.jsf?dictionary=MED-RT MED-RT 2018-09-04 32537
Multum Cerner Multum (Cerner) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html 7/10/2013 44819112
NDC National Drug Code (FDA and manufacturers) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html, http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM070838.zip NDC 20190310 44819105
NDFRT National Drug File - Reference Terminology (VA) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RXNORM 2018-08-12 44819103
NFC New Form Code (EphMRA) http://www.ephmra.org/New-Form-Codes-Classification NFC 20160704 245
None OMOP Standardized Vocabularies OMOP generated v5.0 13-MAR-19 44819096
Note Type OMOP Note Type OMOP generated 44819146
Obs Period Type OMOP Observation Period Type OMOP generated 44819149
Observation Type OMOP Observation Type OMOP generated 44819129
OSM OpenStreetMap https://www.openstreetmap.org/copyright/en, https://wambachers-osm.website/boundaries/ OSM Release 2019-02-21 32541
PCORNet National Patient-Centered Clinical Research Network (PCORI) OMOP generated 44819148
PHDSC Source of Payment Typology (PHDSC) http://www.phdsc.org/standards/payer-typology-source.asp Version 3.0 32473
Plan Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor 32471
Plan Stop Reason Plan Stop Reason - Reason for termination of the Health Plan 32474
PPI AllOfUs_PPI (Columbia) http://terminology.pmi-ops.org/CodeSystem/ppi Codebook Version 0.3.12 581404
Procedure Type OMOP Procedure Occurrence Type OMOP generated 44819128
Race Race and Ethnicity Code Set (USBC) http://www.cdc.gov/nchs/data/dvs/Race_Ethnicity_CodeSet.pdf Version 1.0 44819109
Relationship OMOP Relationship OMOP generated 44819235
RxNorm RxNorm (NLM) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RxNorm 20190204 44819104
RxNorm Extension RxNorm Extension (OMOP) OMOP generated RxNorm Extension 2019-03-11 252
SNOMED Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html Snomed Release 20190131 44819097
Specimen Type OMOP Specimen Type OMOP generated 581376
SPL Structured Product Labeling (FDA) http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm NDC 20190310 44819140
Sponsor Sponsor - institution or individual financing healthcare transactions 32472
UB04 Point of Origin UB04 Claim Source Inpatient Admission Code (CMS) https://www.resdac.org/cms-data/variables/Claim-Source-Inpatient-Admission-Code 32045
UB04 Pri Typ of Adm UB04 Claim Inpatient Admission Type Code (CMS) https://www.resdac.org/cms-data/variables/Claim-Inpatient-Admission-Type-Code 32046
UB04 Pt dis status UB04 Patient Discharge Status Code (CMS) https://www.resdac.org/cms-data/variables/patient-discharge-status-code 32047
UB04 Typ bill UB04 Type of Bill - Institutional (USHIK) https://ushik.ahrq.gov/ViewItemDetails?&system=apcd&itemKey=196987000 32044
UCUM Unified Code for Units of Measure (Regenstrief Institute) http://aurora.regenstrief.org/~ucum/ucum.html#section-Alphabetic-Index Version 1.8.2 44819107
US Census United States Census Bureau https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html US Census 2017 Release 32570
VA Class VA National Drug File Class (VA) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RXNORM 2018-08-12 44819122
VA Product VA National Drug File Product (VA) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RXNORM 2018-08-12 44819120
Visit OMOP Visit OMOP generated 44819119
Visit Type OMOP Visit Type OMOP generated 44819150
Vocabulary OMOP Vocabulary OMOP generated 44819232
turbomam commented 5 years ago

this wasn't a problem when I tried a hybrid of the Linux bulk loader + the R scripts with some modifications (#17)

turbomam commented 5 years ago

@AnthonyMolinaro thanks for looking at my issues. I saw your recent pushes, so I just now did a fresh devtools::install_github("OHDSI/ETL-Synthea")

I think the date reformatting is still required, and that either

  1. concept.concept_name, concept.concept_code, concept_synonym.concept_synonym_name have to be altered to accept NULLs
  2. or the R code has to be modified to put placeholder values in those columns (that's what I did today.)
AnthonyMolinaro commented 5 years ago

@turbomam Thanks again for your work here Mark. I applied the fixes to an earlier commit yesterday.