EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
12 stars 4 forks source link

Load in memory only selected fields from source tables #21

Closed alepev closed 3 years ago

alepev commented 3 years ago

At the moment we are implementing transformations from baseline using synthetic data only for a subset of fields (the subset depends on the target CDM table), so the table is quite small. Later on we expect to map from a baseline export containing ALL fields (or at least the prioritized ones); if we load this table in memory during a transformation, as it happens now, performance will likely be affected. To prevent this, we could implement a filtering step during table loading to keep in memory only the fields we need for the specific transformation; for example, using pandas.read_csv() this can be done with the argument usecols, but our current implementation is based on a different library.. we need to find an equivalent.

Table loading is performed using the wrapper method get_source_data(), which relies on the custom defined class SourceData. Refactoring of the second probably is what we need: https://github.com/EHDEN/ETL-UK-Biobank/blob/master/src/main/python/core/source_data.py

MaximMoinat commented 3 years ago

Agreed, we should not use the SourceData class for this file. Please use pandas.read_csv (as is done for the other transformations already).

We should also think about the memory usage of pandas, as it loads all rows in memory. See for instance this post with some ways around that.

I am thinking about an approach using a source database where we load all csv files in to start with. This would validate the input and allow us to do some simple queries to load in the data for a transformation.

MaximMoinat commented 3 years ago

Instead of wrapper.get_source_data add a function wrapper.get_dataframe (with usecols variable).

alepev commented 3 years ago

Agreed, we should not use the SourceData class for this file. Please use pandas.read_csv (as is done for the other transformations already).

We should also think about the memory usage of pandas, as it loads all rows in memory. See for instance this post with some ways around that.

I am thinking about an approach using a source database where we load all csv files in to start with. This would validate the input and allow us to do some simple queries to load in the data for a transformation.

Nice, the chunksize argument in pandas.read_csv() alone should be already enough in most cases (i.e. when processing rows independently from each other) :) For expensive aggregations, I also found about the dask library.. And then if it's still not enough we can try the database approach :)

MaximMoinat commented 3 years ago

Currently, issue is resolved for the baseline table by using generator of dicts.

But the issue remains with the HES transformations and GP to visit occurrence, that use pandas operations (e.g. merge). HES tables contain millions or records, GP hundreds of millions. To avoid loading the whole table as a pandas dataframe, we have to rewrite part of these transformation.

MaximMoinat commented 3 years ago

Implemented for some other transformations as well.