UCL / TLOmodel

Epidemiology modelling framework for the Thanzi la Onse project
https://www.tlomodel.org/
MIT License
13 stars 5 forks source link

Make `BitsetHandler` variable width and a pandas extension #1448

Closed willGraham01 closed 1 month ago

willGraham01 commented 3 months ago

Concerns #1316 |

Introduces the functionality for us to store bitsets as an extension to pandas, rather than having to use the BitSetHandler as an intermediary class.

bitset series (and thus dataframe columns) are stored as fixed-width bytestrings, where each character is represented by a np.uint8. The size of the bytestring (number of "characters") is chosen when the bitset is created to use the minimum number of bytes for each entry (IE we use 1 byte per 8 possible entries in a bitset).

The BitsetArray provides the necessary ExtensionArray instance for pandas series and dataframes. The operators like <=, ==, <, etc have been overloaded to perform their entry-wise set equivalents, so users should be able to interact with the series in the following manner:

bdt = BitsetDtype(0, 1, 2, 3, 4, 5, 6, 7, 8)
a_series = pd.Series([{"1"}, {"4", "8"}, {"2", "7", "3"}], dtype=big_bdt)

# Return a boolean series indicating equality
big_s == {"4", "8"}
# Return a boolean series indicating entry-wise superset-ness
# Note this is equivalent to the 'in' functionality requested in
# https://github.com/UCL/TLOmodel/issues/1316#issuecomment-2052071639,
# but the 'in' operator has a particular meaning for series so we can't use it for this context.
big_s >= {"3"}
# Add an item to an entry
big_s.loc[0] += {"0"}

There are some more usage examples in the bitset_extension.py file, which is currently my quick-and-easy live testing ground.

TODO:

Tests

The basic pandas suite for testing DtypeExtensions passes, however we need some more tests for the functionality we are including on top. Primarily tests for how we handle different inputs when assigning and comparing (since allowing sets to be input values upsets pandas, since they are also iterables and so can be interpreted in a wonky way).

willGraham01 commented 3 months ago

@matt-graham I won't tag you for review just yet since there's still those bugs mentioned in the PR, but any comments you have (particularly if this was something you were expecting in here) would be appreciated :grin:

willGraham01 commented 2 months ago

Yeah I think everything is ready - will drop a comment on https://github.com/UCL/TLOmodel/issues/1316 saying we can now go ahead with the plan in the issue & remove the BitsetHandler class in favour of direct column manipulation.

(Bumping for review again sorry since new commit = new review required. Also we're hitting an error in the checks on a file that is not touched in this PR)

matt-graham commented 1 month ago

With regards to the failing Pylint check - I did a bit of reading at it seems there are some on-going issues with handling definitions in .pyi files in NumPy specifically within Pylint (https://github.com/pylint-dev/astroid/pull/2375), so I would say as the error does appear to be a false positive here, the simplest solution would be to just add a comment to locally disable checking for E0611 in the relevant line, that is something like

-from numpy.dtypes import BytesDType
+from numpy.dtypes import BytesDType  # pylint: disable=E0611
willGraham01 commented 1 month ago

pylint update fixes everything - adding a reminder note here for me to press merge once CI passes.