Closed willGraham01 closed 1 month ago
@matt-graham I won't tag you for review just yet since there's still those bugs mentioned in the PR, but any comments you have (particularly if this was something you were expecting in here) would be appreciated :grin:
Yeah I think everything is ready - will drop a comment on https://github.com/UCL/TLOmodel/issues/1316 saying we can now go ahead with the plan in the issue & remove the BitsetHandler
class in favour of direct column manipulation.
(Bumping for review again sorry since new commit = new review required. Also we're hitting an error in the checks on a file that is not touched in this PR)
With regards to the failing Pylint check - I did a bit of reading at it seems there are some on-going issues with handling definitions in .pyi
files in NumPy specifically within Pylint (https://github.com/pylint-dev/astroid/pull/2375), so I would say as the error does appear to be a false positive here, the simplest solution would be to just add a comment to locally disable checking for E0611 in the relevant line, that is something like
-from numpy.dtypes import BytesDType
+from numpy.dtypes import BytesDType # pylint: disable=E0611
pylint update fixes everything - adding a reminder note here for me to press merge once CI passes.
Concerns #1316 |
Introduces the functionality for us to store bitsets as an extension to pandas, rather than having to use the
BitSetHandler
as an intermediary class.bitset
series (and thus dataframe columns) are stored as fixed-width bytestrings, where each character is represented by anp.uint8
. The size of the bytestring (number of "characters") is chosen when the bitset is created to use the minimum number of bytes for each entry (IE we use 1 byte per 8 possible entries in a bitset).The
BitsetArray
provides the necessaryExtensionArray
instance for pandas series and dataframes. The operators like<=, ==, <
, etc have been overloaded to perform their entry-wiseset
equivalents, so users should be able to interact with the series in the following manner:There are some more usage examples in the
bitset_extension.py
file, which is currently my quick-and-easy live testing ground.TODO:
Tests
The basic pandas suite for testing
DtypeExtension
s passes, however we need some more tests for the functionality we are including on top. Primarily tests for how we handle different inputs when assigning and comparing (since allowing sets to be input values upsets pandas, since they are also iterables and so can be interpreted in a wonky way).