mem issues - Githubissues

INCEPTdk / omop_etl

3 stars 2 forks source link

mem issues #116

Closed daplaci closed 3 months ago

daplaci commented 3 months ago

I am testing #115 ETL on the data. now the codes mapped are many billions more and the observations to stem is taking quite some time (7 hours for medium hospital). this makes the whole ETL much slower (4 days instead of half a day), and this is WITHOUT unmapped codes.

We should investigate batching or other solutions.

epiben commented 3 months ago

Yeah, come to think of it, this was perhaps to be expected. I'd start with simple batching by variable (one row in concept_lookup_stem = one batch) and see how far that gets us.

Makes me wonder if it had been a better plan to stage the stem table in multiple stem-*.parquet files instead and read from them when populating the clinical omop tables. But that would require way too much refactoring to be feasible now.

daplaci commented 3 months ago

Agree - I am testing now without limiting the memory how fast it gets (it was running before on 120gb with respect to 164 avail). If that also is not enough, we could try probably batching on indexes of concept lookup stem uid

daplaci commented 3 months ago

UPDATE - limiting at 164gb kills the process