EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
12 stars 4 forks source link

Memory issue with hesin_diag to codition_occurrence #310

Closed MaximMoinat closed 3 years ago

MaximMoinat commented 3 years ago

This mapping loads the whole source data (hesin and hesin_diag) into memory, because we do some pandas dataframe operations on it (drop duplicates, merge). As the data is big (millions of records) and the memory on the server low (few GB), this causes a crash because of memory.

We should find an alternative to loading the whole tables in memory.

https://github.com/EHDEN/ETL-UK-Biobank/blob/9ee393a017427eebaa89555b282c71eae9988c64/src/main/python/transformation/hesin_diag_to_condition_occurrence.py#L21-L24

spiros commented 3 years ago

is dask maybe a good alternative?

On Thu, Jul 1, 2021 at 11:26 AM Maxim Moinat @.***> wrote:

This mapping loads the whole source data (hesin and hesin_diag) into memory, because we do some pandas dataframe operations on it (drop duplicates, merge). As the data is big (millions of records) and the memory on the server low (few GB), this causes a crash because of memory.

We should find an alternative to loading the whole tables in memory.

https://github.com/EHDEN/ETL-UK-Biobank/blob/9ee393a017427eebaa89555b282c71eae9988c64/src/main/python/transformation/hesin_diag_to_condition_occurrence.py#L21-L24

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EHDEN/ETL-UK-Biobank/issues/310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFZZBHXS2KGQHEQXQY363TVQRDBANCNFSM47T7ZGHQ .

MaximMoinat commented 3 years ago

Made minor changes, whole ETL can run now.