OHDSI / ETL-Synthea

A package supporting the conversion from Synthea CSV to OMOP CDM
https://ohdsi.github.io/ETL-Synthea/
98 stars 71 forks source link

scaling issue in insert_drug_era.sql #56

Closed scivm closed 3 years ago

scivm commented 4 years ago

NOTE: Using main latest I always get 0 records in my drug_era table and it finishes in less then a second. I had to use the pull request https://github.com/OHDSI/ETL-Synthea/pull/48 to get the drug_era conversion table populated. I basically took that branch and remove the OMOP 6.0 specific code and ended up with a working migration after I also took pull request https://github.com/OHDSI/ETL-Synthea/pull/53 to get the OMOP 5.3.1 death table. I then added some code to migrate the location data in synthea patient table to OMOP location table. Code is at https://github.com/scivm/ETL-Synthea.

Doing testing with this patched setup, I found that with 100K patients generated from synthea, the conversion in insert_drug_era.sql takes 1.85 days. 10k patients took less then a minute and 1k patients took about 15 seconds. I don't think throwing more computer power at it will help since postgres only uses a single core to run the migration and I have already tried to use 8GB ram?

Setup: Postgres 10 on windows 10 laptop with i7 2 cpu with total 8 core, 32GB ram and Ubuntu for windows.

Used 8 GB of ram for shared_buffers

shared_buffers = 8096MB                 # min 128kB
temp_buffers = 128MB                    # min 800kB
work_mem = 64MB                         # min 64kB
maintenance_work_mem = 64MB             # min 1MB
max_stack_depth = 2MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
max_worker_processes = 6                # (change requires restart)
max_parallel_workers_per_gather = 2     # taken from max_parallel_workers
max_parallel_workers = 6                # maximum number of max_worker_processes that

Put indexes on vocabulary tables after loading them:

ALTER TABLE cdm.concept ADD CONSTRAINT xpk_concept PRIMARY KEY (concept_id);
ALTER TABLE cdm.vocabulary ADD CONSTRAINT xpk_vocabulary PRIMARY KEY (vocabulary_id);
ALTER TABLE cdm.domain ADD CONSTRAINT xpk_domain PRIMARY KEY (domain_id);
ALTER TABLE cdm.concept_class ADD CONSTRAINT xpk_concept_class PRIMARY KEY (concept_class_id);
ALTER TABLE cdm.concept_relationship ADD CONSTRAINT xpk_concept_relationship PRIMARY KEY (concept_id_1,concept_id_2,relationship_id);
ALTER TABLE cdm.relationship ADD CONSTRAINT xpk_relationship PRIMARY KEY (relationship_id);
ALTER TABLE cdm.concept_ancestor ADD CONSTRAINT xpk_concept_ancestor PRIMARY KEY (ancestor_concept_id,descendant_concept_id);
ALTER TABLE cdm.source_to_concept_map ADD CONSTRAINT xpk_source_to_concept_map PRIMARY KEY (source_vocabulary_id,target_concept_id,source_code,valid_end_date);
ALTER TABLE cdm.drug_strength ADD CONSTRAINT xpk_drug_strength PRIMARY KEY (drug_concept_id, ingredient_concept_id);

Analyzed all tables before starting

AnthonyMolinaro commented 3 years ago

@scivm Hi, thanks for bringing this to our attention.

We've let ETL-Synthea languish for a bit in 2020, but we're aiming to keep the number of issues at zero in the future. :)

This PR includes (amongst other things) a new function that will pull the latest index and constraint ddl scripts from OHDSI CDM. These scripts (located in the "output" directory) can be run in a separate SQL session or by simply passing them to DatabaseConnector::executeSql().