OHDSI / ETL-Synthea

A package supporting the conversion from Synthea CSV to OMOP CDM
https://ohdsi.github.io/ETL-Synthea/
96 stars 70 forks source link

Loading Vocab From CSV is SLOW #115

Closed jdamato-mitre closed 2 years ago

jdamato-mitre commented 2 years ago

I am trying to load vocabulary from Athena into a Postgres database. using

ETLSyntheaBuilder::LoadVocabFromCsv(connectionDetails = cd, cdmSchema = cdmSchema, vocabFileLoc = vocabFileLoc)

But after a few hours RStudio will just freeze up. Is there a quicker / more reliable way of getting vocabularies into a CDM database?

In Athena I have just RxNorm Extension (OHDSI) and OMOP Extension (OHDSI) chosen under CDM version 5.X. I chose these because they specifically say OHDSI, but I assume I need every name bundle?

These CSV files are not even that large so I am not sure why this would be freezing up. I also tried setting bulkLoad to true but that causes an error.

image

image

I also tried using Rscript instead of going through RStudio, but it still freezes up image

AnthonyMolinaro commented 2 years ago

Hi @jdamato-mitre ,

Some of the vocab tables are HUGE so it may seem like R is freezing up because the records are being inserted one at a time. Loading a vocab is notoriously slow unless bulk loading is enabled. From the error message you posted, it looks like you're having trouble with PG's COPY command. DatabaseConnector leverages the bulk loading capability of the rdbms you happen to be using, so it would be a good idea to check if it works outside R first. Here's a link to the PG bulk loading doc: COPY

I'd start by seeing if the COPY command works from the command line. Assuming allergies.csv is in /Apps/Git/synthea/output/csv, you could try this:

COPY allergies FROM '/Apps/Git/synthea/output/csv/allergies.csv';

If that doesn't work, then bulk loading via the R package won't work either.

jdamato-mitre commented 2 years ago

Thanks @AnthonyMolinaro I was using everything from Athena 5.x and was loading into a docker postgres container running on a Mac (Docker Desktop) switching to Linux (Docker Engine) made things much faster. Still takes about 1.5 hours, but it completes.