Medical-Event-Data-Standard / meds_etl

A collection of ETLs from common data formats to Medical Event Data Standard
Apache License 2.0
16 stars 3 forks source link

slow meds_etl_mimic #22

Closed mirmuss closed 2 months ago

mirmuss commented 2 months ago

Hi,

I am aware of the ongoing efforts to add cpp backend to speed things up.

But I am wondering what is the estimated time it takes to use the default backend to convert mimic to meds?

For me, (Gathering measurements into events, events into timelines) part is estimated to take some hundreds of hours, with num_shards set to 100

EthanSteinberg commented 2 months ago

It's quite slow. The last time I ran it I used 32 cores and it took about 24 hours.

I highly recommend switching to the cpp or duckdb backend (which we just added). I added a warning to the code to tell people to avoid using the polars backend if possible.

I can confirm that the C++ backend completes within about half an hour. One trick with the C++ backend is that you want to set the number of shards to the number of CPU cores (which I added to the README).