Closed MaximMoinat closed 3 years ago
is dask maybe a good alternative?
On Thu, Jul 1, 2021 at 11:26 AM Maxim Moinat @.***> wrote:
This mapping loads the whole source data (hesin and hesin_diag) into memory, because we do some pandas dataframe operations on it (drop duplicates, merge). As the data is big (millions of records) and the memory on the server low (few GB), this causes a crash because of memory.
We should find an alternative to loading the whole tables in memory.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EHDEN/ETL-UK-Biobank/issues/310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFZZBHXS2KGQHEQXQY363TVQRDBANCNFSM47T7ZGHQ .
Made minor changes, whole ETL can run now.
This mapping loads the whole source data (hesin and hesin_diag) into memory, because we do some pandas dataframe operations on it (drop duplicates, merge). As the data is big (millions of records) and the memory on the server low (few GB), this causes a crash because of memory.
We should find an alternative to loading the whole tables in memory.
https://github.com/EHDEN/ETL-UK-Biobank/blob/9ee393a017427eebaa89555b282c71eae9988c64/src/main/python/transformation/hesin_diag_to_condition_occurrence.py#L21-L24