ARM-software / trappy

This repository has moved to https://gitlab.arm.com/tooling/trappy
Apache License 2.0
60 stars 39 forks source link

trace/base: Optimizing DataFrame memory footprint #290

Closed derkling closed 5 years ago

derkling commented 5 years ago

Under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. Values of the same column are represented using the same type and thus number of bytes.

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes.

Use the function pd.to_numeric() to downcast numeric types to use for each value the minumum number of bytes which is still enough to represent the maximum value for a given column.

Use also "Categoricals" introduced in Pandas since version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Use category to efficiently compress the representation of string values by replacing 64bit string pointers with an index using less bits.

Credits goes to:

Using pandas with Large Data Sets https://www.dataquest.io/blog/pandas-big-data/

where these changes are proposed and discussed in details.

The proposed change applied to a 473M example trace gives the following results:

               | Events |  Memory (MB)  |  Compression
               |  count | Before  After |      percent

-----------------+--------+ --------------+------------- clock_disable | 18368 | 3.34 0.88 | 73.652695 clock_enable | 19149 | 3.43 0.92 | 73.177843 clock_set_rate | 42099 | 7.63 2.01 | 73.656619 cpu_idle | 272726 | 28.87 12.74 | 55.871147 sched_switch | 315951 | 82.86 23.26 | 71.928554

Signed-off-by: Patrick Bellasi patrick.bellasi@arm.com

derkling commented 5 years ago

Regarding the need to firsts build the DF I agree. However, this patch main goal is to reduce the memory footprint for live traces and (perhaps) speed-up a bit the analysis we run once the DFs are in memory... although I did not measure that.

A relatively simple extension could be to store the columns dtypes in the cache, that would allow loading the CSV with the correct dtype starting from the second usage of a trace.

derkling commented 5 years ago

Is anyone brave enough to merge this one? :)

valschneider commented 5 years ago

Hmm Travis doesn't seem to be happy all of a sudden, I restarted a round of tests just to make sure.

valschneider commented 5 years ago

Huh so apparently we get this DataFrame:

Time
0.000102     NaN
0.000110     NaN
0.674103    None
0.674115    None
0.678762    None
0.678785    None
0.679127    None
0.681465    None
0.681766       1
0.681778    None
0.681787    None
0.683561       1
0.683991       1
0.685980    None
0.686025    None
0.686068    None
0.686093    None
0.686186    None
0.687742    None
0.687784    None
0.688318    None
0.688932    None
0.688983    None
0.689777    None
0.690072    None
0.690085    None
0.690120    None
0.691775    None
0.691787    None
0.691881    None
            ... 
Name: data, Length: 160, dtype: object

And self.assertTrue(isnan(dfr['data'].iloc[2])) is failing because we get a None, not a NaN... Seems like the test code before your change could handle that. So this might have been a case where we used to do the string optimization for !string columns (dtype is just object).

valschneider commented 5 years ago

Right, and with the previous version (no else: continue) we get this dataframe:

Time
0.000102    NaN
0.000110    NaN
0.674103    NaN
0.674115    NaN
0.678762    NaN
0.678785    NaN
0.679127    NaN
0.681465    NaN
0.681766      1
0.681778    NaN
0.681787    NaN
0.683561      1
0.683991      1
0.685980    NaN
0.686025    NaN
0.686068    NaN
0.686093    NaN
0.686186    NaN
0.687742    NaN
0.687784    NaN
0.688318    NaN
0.688932    NaN
0.688983    NaN
0.689777    NaN
0.690072    NaN
0.690085    NaN
0.690120    NaN
0.691775    NaN
0.691787    NaN
0.691881    NaN
           ... 
Name: data, Length: 160, dtype: category
Categories (5, object): [0, 1, 2, 25601320, 37]

So I think the right thing to do is to keep that else: continue but revert the changes to the tests.

derkling commented 5 years ago

@qais-yousef changes went in before... lets to the DF optimization once the index has been fixed ;)

qais-yousef commented 5 years ago

I've created this pull request to pull the new changes in trappy to lisa

https://github.com/ARM-software/lisa/pull/890