Avoiding usages of `object` dtype properties / columns in `RTI` module

matt-graham commented 6 months ago

Using the fullmodel set of modules, the population dataframe currently shows the current distribution of column datatypes (reported using DataFrame.info method)

dtypes: bool(176), category(68), datetime64[ns](98), float64(11), int64(85), object(6)

All six of the columns with object datatype correspond to properties defined in the RTI module which are used to contain lists of injury codes

https://github.com/UCL/TLOmodel/blob/bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d/src/tlo/methods/rti.py#L1051-L1058 https://github.com/UCL/TLOmodel/blob/bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d/src/tlo/methods/rti.py#L1079-L1080

For a population dataframe with 51000 rows, the memory usage of each of these columns is around 2.3MiB compared to 0.4MiB for a corresponding column with 64-bit integers or floating point values.

While these columns currently use a list type, in reality I believe they are used as sets, with the lists used to store unique injury codes without duplicates and the order in the list abritrary. This suggests a bitset representation would be more parsimonious memory wise while also potentially reducing the cost of operations to update or check membership of the sets. However, the number of injury codes currently defined in

https://github.com/UCL/TLOmodel/blob/bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d/src/tlo/methods/rti.py#L66-L73

is 97 which is too many to use with a bitset with 64-bit integer representation, and there aren't any 128-bit integer types available in Pandas (or the additional types available via PyArrow).

One option would be to extend BitsetHandler to allow using columns with variable bit-width using the fixed-length string / bytes data dtype. For example

series = pd.Series([""] * 10, dtype="S13")

creates a series with 10 entries with with 13 bytes = 104 bits per value (which would be the minimum number of bytes required to represent 97 injury codes). This can be viewed as an array of unsigned integers / bytes using

series.array.view("13B")

with output

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

This view can then be updated in-place using the bitwise operators &= and |= as required in the update operations in the current BitsetHandler implementation, with this updating the underlying byte array (as its acting on a view rather than a copy).

As well as being use in allowing us to represent bitsets with more than 64 elements this would also allow a more parsimonious representation for bitsets which require fewer elements.

matt-graham commented 6 months ago

Even nicer (and potentially something we could factor out in to a reusable library) would be to wrap up the bitset handler functionality with variable bit-width into a Pandas extension type similar to the in-built categorical type to allow something like

>>> series = pd.Series([{"a", "c"}, {"c"}, {}, {"a", "b"}], dtype=BitsetDType(elements=["a", "b", "c"]))
>>> "a" in series
0   True
1   False
2   False
3   True
dtype: bool
>>> series | {"a"}
0   {"a", "c"}
1   {"a", "c"}
2   {"a"}
3   {"a", "b", "c"}
dtype: BitsetDType

willGraham01 commented 2 weeks ago

With the merging of #1448, we now have this functionality available. So we should be able to (on a per-module basis) work through the codebase and replace both the BitsetHandler class instances, as well as the columns in the dataframe mentioned here, with our custom BitsetDtype extension.

UCL / TLOmodel

Avoiding usages of `object` dtype properties / columns in `RTI` module #1316