Make 'ID'and 'TYPE' columns pd.Categorical instead of str, to reduce the memory spike when using pd.pivot_table in sam_format_to_wide.
According to some tests, using a dataframe of 534.9 MB in sam format, the size is reduced to 352.3 MB when making ID and TYPE categorical. And the memory spike when pivoting that table is reduced by 17% on average. Please, consider that this is an approximated result, due to the difficulty of monitoring memory spikes.
The pandas implementation with categorical variables seems to be more stable, in terms of memory spikes, than alternative implementations in dask or polars
Make
'ID'
and'TYPE'
columnspd.Categorical
instead ofstr
, to reduce the memory spike when usingpd.pivot_table
insam_format_to_wide
.According to some tests, using a dataframe of 534.9 MB in sam format, the size is reduced to 352.3 MB when making ID and TYPE categorical. And the memory spike when pivoting that table is reduced by 17% on average. Please, consider that this is an approximated result, due to the difficulty of monitoring memory spikes.
The pandas implementation with categorical variables seems to be more stable, in terms of memory spikes, than alternative implementations in
dask
orpolars