Closed alepev closed 3 years ago
Agreed, we should not use the SourceData
class for this file. Please use pandas.read_csv
(as is done for the other transformations already).
We should also think about the memory usage of pandas, as it loads all rows in memory. See for instance this post with some ways around that.
I am thinking about an approach using a source database where we load all csv files in to start with. This would validate the input and allow us to do some simple queries to load in the data for a transformation.
Instead of wrapper.get_source_data
add a function wrapper.get_dataframe
(with usecols variable).
Agreed, we should not use the
SourceData
class for this file. Please usepandas.read_csv
(as is done for the other transformations already).We should also think about the memory usage of pandas, as it loads all rows in memory. See for instance this post with some ways around that.
I am thinking about an approach using a source database where we load all csv files in to start with. This would validate the input and allow us to do some simple queries to load in the data for a transformation.
Nice, the chunksize
argument in pandas.read_csv()
alone should be already enough in most cases (i.e. when processing rows independently from each other) :) For expensive aggregations, I also found about the dask
library..
And then if it's still not enough we can try the database approach :)
Currently, issue is resolved for the baseline
table by using generator of dicts.
But the issue remains with the HES transformations and GP to visit occurrence, that use pandas operations (e.g. merge
). HES tables contain millions or records, GP hundreds of millions. To avoid loading the whole table as a pandas dataframe, we have to rewrite part of these transformation.
Implemented for some other transformations as well.
At the moment we are implementing transformations from
baseline
using synthetic data only for a subset of fields (the subset depends on the target CDM table), so the table is quite small. Later on we expect to map from abaseline
export containing ALL fields (or at least the prioritized ones); if we load this table in memory during a transformation, as it happens now, performance will likely be affected. To prevent this, we could implement a filtering step during table loading to keep in memory only the fields we need for the specific transformation; for example, usingpandas.read_csv()
this can be done with the argumentusecols
, but our current implementation is based on a different library.. we need to find an equivalent.Table loading is performed using the wrapper method
get_source_data()
, which relies on the custom defined classSourceData
. Refactoring of the second probably is what we need: https://github.com/EHDEN/ETL-UK-Biobank/blob/master/src/main/python/core/source_data.py