pd.NA treated differently in `filter_array_like` with newest pandas version #504

DamianBarabonkovQC opened 2 years ago

DamianBarabonkovQC commented 2 years ago

Problem description

In an older version of pandas (before pandas commit https://github.com/pandas-dev/pandas/commit/b2d54d9c16990bd8eaeacd4de24fc33cfdbfb43b), when filter_array_like saw a pd.NA in the context of a pandas BooleanArray, it treated it as a False. In newer versions (after https://github.com/pandas-dev/pandas/commit/b2d54d9c16990bd8eaeacd4de24fc33cfdbfb43b), the pd.NA is treated as pd.NA, which when casting to a numpy array causes an error.

This relates to the pandas issue: https://github.com/pandas-dev/pandas/issues/45249 which is actually a new behavioral change and not a BUG. The old functionality of treating pd.NA as False was a bug actually.

Example code (ideally copy-pastable)

Please provide a minimal reproducible code example to reproduce the behavior,

import pandas as pd
from kartothek.serialization import filter_array_like

boolean_array = pd.array([True, False, None], dtype="boolean")
# BooleanArray>
# [True, False, <NA>]
# Length: 3, dtype: boolean

ret = filter_array_like(

print(boolean_array, ret)
# Newer pandas: ValueError: cannot convert to 'bool'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
# Older pandas: <BooleanArray>
#                          [True, False, <NA>]
#                          Length: 3, dtype: boolean [False  True  True]

xhochy commented 2 years ago

Is there anything that needs to be adressed regarding this in kartothek?

DamianBarabonkovQC commented 2 years ago

I have a hacky patch in filter_array_like that looks like:

    with np.errstate(invalid="ignore"):
        if op == "==":
            if pd.isnull(value):
                np.logical_and(pd.isnull(array_like), mask, out=out)
                res_eq = array_like == value
                np.logical_and(res_eq.fillna(False), mask, out=out)

basically filling in any NA with False during the comparison before giving it up to np.logical_and