HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

Dataset copy in non-inplace changes #68

Open alessiamarcolini opened 4 years ago

alessiamarcolini commented 4 years ago

My idea would be to use Pandas for the copy of the DataFrame (which is the biggest object in memory for the instance).

Looking at the docs they say (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html):

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below

So it could be more memory-efficient. According to https://stackoverflow.com/questions/9058305/getting-attributes-of-a-class , a possible suggestion for implementation could be gather the attributes except for _df attribute (that will be copied with pd.copy()) and create a new instance with the copy.deepcopy() of those attributes (not _df):

for attribute, value in self.__dict__.items():
    print(attribute, '=', value)
copied_d = Dataset()
copied_d.__class__.__dict__ = {key: copy.deepcopy(value) for (key, value) in original_d.__class__.__dict__.items() if key != '_df'}
copied_d._df = original_d._df.copy()

What do you think? Does it make sense?

_Originally posted by @lorenz-gorini in https://github.com/HK3-Lab-Team/pytrousse/pull/67#discussion_r493494867_

lorenz-gorini commented 4 years ago

Requires #38 and #39 to be fixed (removing the possibility to create a Dataset from either an existing pd.DataFrame or from a CSV file)