Open matt-graham opened 6 months ago
Even nicer (and potentially something we could factor out in to a reusable library) would be to wrap up the bitset handler functionality with variable bit-width into a Pandas extension type similar to the in-built categorical type to allow something like
>>> series = pd.Series([{"a", "c"}, {"c"}, {}, {"a", "b"}], dtype=BitsetDType(elements=["a", "b", "c"]))
>>> "a" in series
0 True
1 False
2 False
3 True
dtype: bool
>>> series | {"a"}
0 {"a", "c"}
1 {"a", "c"}
2 {"a"}
3 {"a", "b", "c"}
dtype: BitsetDType
With the merging of #1448, we now have this functionality available. So we should be able to (on a per-module basis) work through the codebase and replace both the BitsetHandler
class instances, as well as the columns in the dataframe mentioned here, with our custom BitsetDtype
extension.
Using the
fullmodel
set of modules, the population dataframe currently shows the current distribution of column datatypes (reported usingDataFrame.info
method)All six of the columns with
object
datatype correspond to properties defined in theRTI
module which are used to contain lists of injury codeshttps://github.com/UCL/TLOmodel/blob/bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d/src/tlo/methods/rti.py#L1051-L1058 https://github.com/UCL/TLOmodel/blob/bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d/src/tlo/methods/rti.py#L1079-L1080
For a population dataframe with 51000 rows, the memory usage of each of these columns is around 2.3MiB compared to 0.4MiB for a corresponding column with 64-bit integers or floating point values.
While these columns currently use a list type, in reality I believe they are used as sets, with the lists used to store unique injury codes without duplicates and the order in the list abritrary. This suggests a bitset representation would be more parsimonious memory wise while also potentially reducing the cost of operations to update or check membership of the sets. However, the number of injury codes currently defined in
https://github.com/UCL/TLOmodel/blob/bd0a7632e1b0add9ae0ae8d3afc2a151bdd15f7d/src/tlo/methods/rti.py#L66-L73
is 97 which is too many to use with a bitset with 64-bit integer representation, and there aren't any 128-bit integer types available in Pandas (or the additional types available via PyArrow).
One option would be to extend
BitsetHandler
to allow using columns with variable bit-width using the fixed-length string / bytes data dtype. For examplecreates a series with 10 entries with with 13 bytes = 104 bits per value (which would be the minimum number of bytes required to represent 97 injury codes). This can be viewed as an array of unsigned integers / bytes using
with output
This view can then be updated in-place using the bitwise operators
&=
and|=
as required in the update operations in the currentBitsetHandler
implementation, with this updating the underlying byte array (as its acting on a view rather than a copy).As well as being use in allowing us to represent bitsets with more than 64 elements this would also allow a more parsimonious representation for bitsets which require fewer elements.