Closed oneilsh closed 3 years ago
I managed to get this to run via some hints in #57, with some tweaks to tableName
RE #86:
LoadVocabFromCsv <- function (connectionDetails, cdmDatabaseSchema, vocabFileLoc)
{
#csvList <- c("concept.csv","vocabulary.csv","concept_ancestor.csv","concept_relationship.csv","relationship.csv","concept_synonym.csv","domain.csv","concept_class.csv", "drug_strength.csv")
csvList <- c("vocabulary.csv")
conn <- DatabaseConnector::connect(connectionDetails)
for (csv in csvList) {
vocabTable <- data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, header = TRUE, sep = "\t", na.strings = NULL)
# Format Dates for tables that need it
if (base::identical(csv,"concept.csv") || base::identical(csv,"concept_relationship.csv") || base::identical(csv,"drug_strength.csv")) {
vocabTable$valid_start_date <- as.Date(as.character(vocabTable$valid_start_date),"%Y%m%d")
vocabTable$valid_end_date <- as.Date(as.character(vocabTable$valid_end_date),"%Y%m%d")
}
writeLines(paste0("Loading: ",csv))
DatabaseConnector::insertTable(conn,tableName=paste0(cdmDatabaseSchema,".",strsplit(csv,"[.]")[[1]][1]), data=as.data.frame(vocabTable), dropTableIfExists = FALSE, createTable = FALSE, progressBar = TRUE)
}
on.exit(DatabaseConnector::disconnect(conn))
}
LoadVocabFromCsv(cd, cdmSchema, vocabFileLoc)
@oneilsh What did you change in the code? I had the same issue but I simply added a random value to the vocabulary.csv
file in the missing column.
Depends what you mean - I didn't modify any of the codebase internals, just included a version of the LoadVocabFromCsv function in the script, from the version defined in the fork linked from #57 with an explicit setting of the tableName
parameter in the call to DatabaseConnector::insertTable
(as I recall I had trouble getting params to match right by position against the signatutre for insertTable, possibly the param list has changed in newer versions of DatabaseConnector).
Here's my frankenstein script so far:
library(ETLSyntheaBuilder)
# download the drivers if needed to the current working directory
#downloadJdbcDrivers("postgresql", ".")
cd <- DatabaseConnector::createConnectionDetails(
dbms = "postgresql",
server = "localhost/synthea10",
user = "oneils",
password = "",
port = 5432,
pathToDriver = "."
)
cdmSchema <- "cdm_synthea10"
cdmVersion <- "5.3.1"
syntheaVersion <- "2.7.0"
syntheaSchema <- "native"
syntheaFileLoc <- "100k_synthea_covid19_csv"
vocabFileLoc <- "omop_vocab_athena_download"
Sys.setenv(POSTGRES_PATH = "/usr/local/bin/postgres")
# ahem https://github.com/OHDSI/ETL-Synthea/issues/57
# along with change from https://github.com/OHDSI/ETL-Synthea/commit/af15bc1f42097fb08b2291066daf399ed2b68fa1#diff-22371047436aa33aa4e84fc366ec4a76b6f0936e06ca5c90b2df5fbbb0b1ad44
LoadVocabFromCsv <- function (connectionDetails, cdmDatabaseSchema, vocabFileLoc)
{
#csvList <- c("concept.csv","vocabulary.csv","concept_ancestor.csv","concept_relationship.csv","relationship.csv","concept_synonym.csv","domain.csv","concept_class.csv", "drug_strength.csv")
csvList <- c("vocabulary.csv")
conn <- DatabaseConnector::connect(connectionDetails)
for (csv in csvList) {
vocabTable <- data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, header = TRUE, sep = "\t", na.strings = NULL)
# Format Dates for tables that need it
if (base::identical(csv,"concept.csv") || base::identical(csv,"concept_relationship.csv") || base::identical(csv,"drug_strength.csv")) {
vocabTable$valid_start_date <- as.Date(as.character(vocabTable$valid_start_date),"%Y%m%d")
vocabTable$valid_end_date <- as.Date(as.character(vocabTable$valid_end_date),"%Y%m%d")
}
writeLines(paste0("Loading: ",csv))
DatabaseConnector::insertTable(conn,tableName=paste0(cdmDatabaseSchema,".",strsplit(csv,"[.]")[[1]][1]), data=as.data.frame(vocabTable), dropTableIfExists = FALSE, createTable = FALSE, progressBar = TRUE)
}
on.exit(DatabaseConnector::disconnect(conn))
}
ETLSyntheaBuilder::CreateCDMTables(connectionDetails = cd, cdmSchema = cdmSchema, cdmVersion = cdmVersion)
ETLSyntheaBuilder::CreateSyntheaTables(connectionDetails = cd, syntheaSchema = syntheaSchema, syntheaVersion = syntheaVersion)
# Optional: Create index and constraint DDL scripts for the rdbms that support them. Scripts will be written to the "output" directory.
ETLSyntheaBuilder::CreateCDMIndexAndConstraintScripts(connectionDetails = cd, cdmSchema = cdmSchema, cdmVersion = cdmVersion, githubTag = "v5.3.1_fixes")
system("psql -d synthea10 -f output/postgresql_5.3.1_index_ddl.sql")
ETLSyntheaBuilder::LoadSyntheaTables(connectionDetails = cd, syntheaSchema = syntheaSchema, syntheaFileLoc = syntheaFileLoc)
ETLSyntheaBuilder::LoadVocabFromCsv(connectionDetails = cd, cdmSchema = cdmSchema, vocabFileLoc = vocabFileLoc)
# fix for null constraint violation error when loading vocabulary.csv
LoadVocabFromCsv(cd, cdmSchema, vocabFileLoc)
ETLSyntheaBuilder::LoadEventTables(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, cdmVersion = cdmVersion)
Thanks! Yes, I was referring to the difference between the function from the issue and the original one. I kind of don't unuderstand yet why the tableName makes a difference but I'll try again tomorrow ;)
I put my first attempt of loading Synthea to OMOP into Docker containers, maybe useful: https://github.com/kaiserpreusse/synthea2omop
EDITED: Just realized I addressed me question to the wrong person. : ) @oneilsh Just curious, is there missing data in the vocabulary.csv file you're using? I'm asking because from what you originally posted, a not-null constraint was violated. If you execute:
vocabTable <- data.table::fread(file = "omop_vocab_athena_download/vocabulary.csv", stringsAsFactors = FALSE, header = TRUE, sep = "\t", na.strings = "")
View(vocabTable)
Do you see missing data? (Would show up as NA in R)
Aha! Sorry, I'm pretty new to OHDSI and medical vocabularies. Looking at the result I'm guessing I got overly ambitious in which vocabularies I grabbed from Athena (since I wasn't sure which ones to grab, I did most of the ones that weren't licensed)
> vocabTable
vocabulary_id vocabulary_name vocabulary_reference vocabulary_version vocabulary_concept_id
1: Death Type OMOP Death Type OMOP generated <NA> 44819135
2: Cohort Type OMOP Cohort Type OMOP generated <NA> 44819234
3: Condition Status OMOP Condition Status OMOP generated <NA> 32887
4: NUCC National Uniform Claim Committee Health Care Provider Taxonomy Code Set (NUCC) http://www.nucc.org/index.php?option=com_content&view=article&id=107&Itemid=132 2018-06-26 NUCC 44819137
5: Revenue Code UB04/CMS1450 Revenue Codes (CMS) http://www.mpca.net/?page=ERC_finance 2010 Release 44819133
6: ICD9CM International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 1 and 2 (NCHS) http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html ICD9CM v32 master descriptions 5046
7: Type Concept OMOP Type Concept OMOP generated Type Concept 20210212 32808
8: Race Race and Ethnicity Code Set (USBC) http://www.cdc.gov/nchs/data/dvs/Race_Ethnicity_CodeSet.pdf Version 1.0 44819109
9: Domain OMOP Domain OMOP generated <NA> 44819147
10: CMS Place of Service Place of Service Codes for Professional Claims (CMS) http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/PhysicianFeeSched/downloads//Website_POS_database.pdf 2009-01-11 44819110
11: Medicare Specialty Medicare provider/supplier specialty codes (CMS) http://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/MedicareProviderSupEnroll/Taxonomy.html 2018-06-26 Specialty 44819138
12: None OMOP Standardized Vocabularies OMOP generated v5.0 26-FEB-21 44819096
13: Note Type OMOP Note Type OMOP generated <NA> 44819146
14: PHDSC Source of Payment Typology (PHDSC) http://www.phdsc.org/standards/payer-typology-source.asp Version 3.0 32473
15: Cost OMOP Cost OMOP generated <NA> 581457
16: UB04 Point of Origin UB04 Claim Source Inpatient Admission Code (CMS) https://www.resdac.org/cms-data/variables/Claim-Source-Inpatient-Admission-Code <NA> 32045
17: Meas Type OMOP Measurement Type OMOP generated <NA> 44819152
18: Visit OMOP Visit OMOP generated <NA> 44819119
19: Visit Type OMOP Visit Type OMOP generated <NA> 44819150
20: ICDO3 International Classification of Diseases for Oncology, Third Edition (WHO) https://seer.cancer.gov/icd-o-3/ ICDO3 SEER Site/Histology Released 06/2019 581426
21: UB04 Typ bill UB04 Type of Bill - Institutional (USHIK) https://ushik.ahrq.gov/ViewItemDetails?&system=apcd&itemKey=196987000 <NA> 32044
22: LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) http://loinc.org/downloads/loinc 2.68 44819102
23: US Census United States Census Bureau https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html US Census 2017 Release 32570
24: Condition Type OMOP Condition Occurrence Type OMOP generated <NA> 44819127
25: Metadata Metadata OMOP generated <NA> 32675
26: Ethnicity OMOP Ethnicity OMOP generated <NA> 44819134
27: ICD10PCS ICD-10 Procedure Coding System (CMS) http://www.cms.gov/Medicare/Coding/ICD10/index.html ICD10PCS 2021 44819125
28: Procedure Type OMOP Procedure Occurrence Type OMOP generated <NA> 44819128
29: Episode OMOP Episode OMOP generated Episode 20201014 32523
30: Obs Period Type OMOP Observation Period Type OMOP generated <NA> 44819149
31: Cost Type OMOP Cost Type OMOP generated <NA> 5029
32: Plan Stop Reason Plan Stop Reason - Reason for termination of the Health Plan OMOP generated <NA> 32474
33: Concept Class OMOP Concept Class OMOP generated <NA> 44819233
34: UCUM Unified Code for Units of Measure (Regenstrief Institute) http://aurora.regenstrief.org/~ucum/ucum.html#section-Alphabetic-Index Version 1.8.2 44819107
35: UB04 Pri Typ of Adm UB04 Claim Inpatient Admission Type Code (CMS) https://www.resdac.org/cms-data/variables/Claim-Inpatient-Admission-Type-Code <NA> 32046
36: ATC WHO Anatomic Therapeutic Chemical Classification http://www.whocc.no/atc_ddd_index/ RxNorm 20200504 44819117
37: Plan Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor OMOP generated <NA> 32471
38: CPT4 Current Procedural Terminology version 4 (AMA) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 2020 Release 44819100
39: Drug Type OMOP Drug Exposure Type OMOP generated <NA> 44819126
40: CDM OMOP Common DataModel https://github.com/OHDSI/CommonDataModel CDM v6.0.2 32485
41: Observation Type OMOP Observation Type OMOP generated <NA> 44819129
42: HCPCS Healthcare Common Procedure Coding System (CMS) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 2020 Alpha Numeric HCPCS File 44819101
43: Vocabulary OMOP Vocabulary OMOP generated <NA> 44819232
44: ABMS Provider Specialty (American Board of Medical Specialties) http://www.abms.org/member-boards/specialty-subspecialty-certificates 2018-06-26 ABMS 45756746
45: Device Type OMOP Device Type OMOP generated <NA> 44819151
46: ICD10 International Classification of Diseases, Tenth Revision (WHO) http://www.who.int/classifications/icd/icdonlineversions/en/ 2020 Release 44819124
47: OSM OpenStreetMap https://www.openstreetmap.org/copyright/en, https://wambachers-osm.website/boundaries/ OSM Release 2019-02-21 32541
48: Sponsor Sponsor - institution or individual financing healthcare transactions OMOP generated <NA> 32472
49: UB04 Pt dis status UB04 Patient Discharge Status Code (CMS) https://www.resdac.org/cms-data/variables/patient-discharge-status-code <NA> 32047
50: ICD9Proc International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html ICD9CM v32 master descriptions 44819099
51: ICD10CM International Classification of Diseases, Tenth Revision, Clinical Modification (NCHS) http://www.cdc.gov/nchs/icd/icd10cm.htm ICD10CM FY2021 code descriptions 44819098
52: Relationship OMOP Relationship OMOP generated <NA> 44819235
53: NDC National Drug Code (FDA and manufacturers) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html, http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM070838.zip NDC 20210221 44819105
54: SPL Structured Product Labeling (FDA) http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm NDC 20210221 44819140
55: OMOP Extension OMOP Extension (OHDSI) OMOP generated OMOP Extension 20210226 32758
56: Korean Revenue Code Korean Revenue Code OMOP generated <NA> 32724
57: Gender OMOP Gender OMOP generated <NA> 44819108
58: Currency International Currency Symbol (ISO 4217) http://www.iso.org/iso/home/standards/currency_codes.htm 2008 44819153
59: RxNorm RxNorm (NLM) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RxNorm 20210104 44819104
60: RxNorm Extension RxNorm Extension (OHDSI) OMOP generated RxNorm Extension 2021-02-12 252
61: SNOMED Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 2020-07-31 SNOMED CT International Edition; 2020-09-01 SNOMED CT US Edition; 2020-10-28 SNOMED CT UK Edition 44819097
vocabulary_id vocabulary_name vocabulary_reference vocabulary_version vocabulary_concept_id
@oneilsh Np. : )
I don't think there should be missing data in that file though (or the vocabulary_version column should, in fact, be nullable).
As a temporary hack, you can do something like:
if (csv == "vocabulary.csv")
vocabTable <- vocabTable[-which(is.na(vocabTable$vocabulary_version))]
In the meantime, I'll check with our resident CDM expert to see if I can find out why there might be missing data from that file.
Hi @clairblacketer , is it expected that the vocabulary_version field of the vocabulary.csv downloaded from Athena is null?
@oneilsh Hi Shawn, so Clair has confirmed that vocabulary_version should be NULL: http://ohdsi.github.io/CommonDataModel/cdm531.html#VOCABULARY
This is the case in cdm version 6.0.0. Alternatively, you can specify githubTag = "v5.3.1_fixes" when creating the cdm tables. Since some people might not be ale to use v5.3.1_fixes, I'm going to add a quick fix to change the ddl to allow this field to be NULL.
Hello again :) I'm still working on ETLing the Synthea COVID 100K dataset, running into an error in the LoadVocabFromCsv step.
I did diverge from the suggested code slightly by generating the indexing scripts and running the index one (in a previous run the LoadEventTables step ran for > 4 hours on the condition_occurance table I think it was, after generating the indices, so I'm trying again from scratch with the latest codebase...)
some settings:
I get an error for
vocabulary.csv
, but the others appear to have loaded successfully:Thank you for looking!