JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

pd.NA treated differently in `filter_array_like` with newest pandas version #504

Open DamianBarabonkovQC opened 2 years ago

DamianBarabonkovQC commented 2 years ago

Problem description

In an older version of pandas (before pandas commit https://github.com/pandas-dev/pandas/commit/b2d54d9c16990bd8eaeacd4de24fc33cfdbfb43b), when filter_array_like saw a pd.NA in the context of a pandas BooleanArray, it treated it as a False. In newer versions (after https://github.com/pandas-dev/pandas/commit/b2d54d9c16990bd8eaeacd4de24fc33cfdbfb43b), the pd.NA is treated as pd.NA, which when casting to a numpy array causes an error.

This relates to the pandas issue: https://github.com/pandas-dev/pandas/issues/45249 which is actually a new behavioral change and not a BUG. The old functionality of treating pd.NA as False was a bug actually.

Example code (ideally copy-pastable)

Please provide a minimal reproducible code example to reproduce the behavior,

import pandas as pd
from kartothek.serialization import filter_array_like

boolean_array = pd.array([True, False, None], dtype="boolean")
# BooleanArray>
# [True, False, <NA>]
# Length: 3, dtype: boolean

ret = filter_array_like(
    boolean_array,
    "==",
    False,
)

print(boolean_array, ret)
# Newer pandas: ValueError: cannot convert to 'bool'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
# Older pandas: <BooleanArray>
#                          [True, False, <NA>]
#                          Length: 3, dtype: boolean [False  True  True]

Used versions

``` # packages in environment at /opt/miniconda3/envs/nightly: # # Name Version Build Channel abseil-cpp 20210324.2 he49afe7_0 conda-forge alabaster 0.7.12 py_0 conda-forge altair 4.2.0 pyhd8ed1ab_1 conda-forge appdirs 1.4.4 pyh9f0ad1d_0 conda-forge appnope 0.1.2 pypi_0 pypi argon2-cffi 21.3.0 pyhd8ed1ab_0 conda-forge argon2-cffi-bindings 21.2.0 pypi_0 pypi arrow-cpp 6.0.1 py310h71bd60a_7_cpu conda-forge async_generator 1.10 py_0 conda-forge attrs 21.4.0 pyhd8ed1ab_0 conda-forge aws-c-cal 0.5.11 hd2e2f4b_0 conda-forge aws-c-common 0.6.2 h0d85af4_0 conda-forge aws-c-event-stream 0.2.7 hb9330a7_13 conda-forge aws-c-io 0.10.5 h35aa462_0 conda-forge aws-checksums 0.1.11 h0010a65_7 conda-forge aws-sdk-cpp 1.8.186 h766a74d_3 conda-forge babel 2.9.1 pyh44b312d_0 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 py_2 conda-forge backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge bleach 4.1.0 pyhd8ed1ab_0 conda-forge bokeh 2.4.2 pypi_0 pypi brotlipy 0.7.0 pypi_0 pypi bzip2 1.0.8 h0d85af4_4 conda-forge c-ares 1.18.1 h0d85af4_0 conda-forge ca-certificates 2021.10.8 h033912b_0 conda-forge certifi 2021.10.8 pypi_0 pypi cffi 1.15.0 pypi_0 pypi cfgv 3.3.1 pyhd8ed1ab_0 conda-forge charset-normalizer 2.0.9 pyhd8ed1ab_0 conda-forge click 8.0.3 pypi_0 pypi cloudpickle 2.0.0 pyhd8ed1ab_0 conda-forge colorama 0.4.4 pyh9f0ad1d_0 conda-forge coverage 6.2 pypi_0 pypi cryptography 36.0.1 pypi_0 pypi cython 0.29.26 pypi_0 pypi cytoolz 0.11.2 pypi_0 pypi dask 2021.12.0 pyhd8ed1ab_0 conda-forge dask-core 2021.12.0 pyhd8ed1ab_0 conda-forge debugpy 1.5.1 pypi_0 pypi decorator 5.1.0 pyhd8ed1ab_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge distlib 0.3.4 pyhd8ed1ab_0 conda-forge distributed 2021.12.0 pypi_0 pypi docutils 0.17.1 pypi_0 pypi editdistance-s 1.0.0 pypi_0 pypi entrypoints 0.3 pyhd8ed1ab_1003 conda-forge filelock 3.4.2 pyhd8ed1ab_0 conda-forge flit-core 3.6.0 pyhd8ed1ab_0 conda-forge freetype 2.10.4 h4cff582_1 conda-forge freezegun 1.1.0 pyhd8ed1ab_0 conda-forge fsspec 2021.11.1 pyhd8ed1ab_0 conda-forge gflags 2.2.2 hb1e8313_1004 conda-forge glog 0.5.0 h25b26a9_0 conda-forge great-expectations 0.13.49 pyha770c72_0 conda-forge grpc-cpp 1.42.0 h6da9ac5_1 conda-forge heapdict 1.0.1 py_0 conda-forge identify 2.3.7 pyhd8ed1ab_0 conda-forge idna 3.1 pyhd3deb0d_0 conda-forge imagesize 1.3.0 pyhd8ed1ab_0 conda-forge importlib-metadata 4.10.0 pypi_0 pypi importlib_resources 5.4.0 pyhd8ed1ab_0 conda-forge iniconfig 1.1.1 pyh9f0ad1d_0 conda-forge ipykernel 6.6.1 pypi_0 pypi ipython 7.30.1 pypi_0 pypi ipython_genutils 0.2.0 py_1 conda-forge ipywidgets 7.6.5 pyhd8ed1ab_0 conda-forge jbig 2.1 h0d85af4_2003 conda-forge jedi 0.18.1 pypi_0 pypi jinja2 3.0.3 pyhd8ed1ab_0 conda-forge jpeg 9d hbcb3906_0 conda-forge jsonpatch 1.32 pyhd8ed1ab_0 conda-forge jsonpointer 2.0 py_0 conda-forge jsonschema 4.3.3 pyhd8ed1ab_0 conda-forge jupyter-core 4.9.1 pypi_0 pypi jupyter_client 7.1.0 pyhd8ed1ab_0 conda-forge jupyter_core 4.9.1 py310h2ec42d9_1 conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge jupyterlab_widgets 1.0.2 pyhd8ed1ab_0 conda-forge jupytext 1.13.5 pyheef035f_0 conda-forge kartothek 4.0.3 pyhd8ed1ab_1 conda-forge krb5 1.19.2 hcfbf3a7_3 conda-forge lcms2 2.12 h577c468_0 conda-forge lerc 3.0 he49afe7_0 conda-forge libblas 3.9.0 12_osx64_openblas conda-forge libbrotlicommon 1.0.9 h0d85af4_6 conda-forge libbrotlidec 1.0.9 h0d85af4_6 conda-forge libbrotlienc 1.0.9 h0d85af4_6 conda-forge libcblas 3.9.0 12_osx64_openblas conda-forge libcurl 7.80.0 hf45b732_1 conda-forge libcxx 12.0.1 habf9029_1 conda-forge libdeflate 1.8 h0d85af4_0 conda-forge libedit 3.1.20191231 h0678c8f_2 conda-forge libev 4.33 haf1e3a3_1 conda-forge libevent 2.1.10 h815e4d9_4 conda-forge libffi 3.4.2 h0d85af4_5 conda-forge libgfortran 5.0.0 9_3_0_h6c81a4c_23 conda-forge libgfortran5 9.3.0 h6c81a4c_23 conda-forge liblapack 3.9.0 12_osx64_openblas conda-forge libnghttp2 1.43.0 h6f36284_1 conda-forge libopenblas 0.3.18 openmp_h3351f45_0 conda-forge libpng 1.6.37 h7cec526_2 conda-forge libprotobuf 3.19.1 hcf210ce_0 conda-forge libsodium 1.0.18 hbcb3906_1 conda-forge libssh2 1.10.0 h52ee1ee_2 conda-forge libthrift 0.15.0 hab56fdc_1 conda-forge libtiff 4.3.0 hd146c10_2 conda-forge libutf8proc 2.7.0 h0d85af4_0 conda-forge libwebp-base 1.2.1 h0d85af4_0 conda-forge libzlib 1.2.11 h9173be1_1013 conda-forge llvm-openmp 12.0.1 hda6cdc1_1 conda-forge locket 0.2.0 py_2 conda-forge lz4-c 1.9.3 he49afe7_1 conda-forge make 4.3 h22f3db7_1 conda-forge markdown-it-py 1.1.0 pyhd8ed1ab_0 conda-forge markupsafe 2.0.1 pypi_0 pypi matplotlib-inline 0.1.3 pyhd8ed1ab_0 conda-forge mdit-py-plugins 0.3.0 pyhd8ed1ab_0 conda-forge milksnake 0.1.5 py_0 conda-forge minimalkv 1.3.1 pyhd8ed1ab_1 conda-forge mistune 0.8.4 pypi_0 pypi more-itertools 8.12.0 pyhd8ed1ab_0 conda-forge msgpack 1.0.3 pypi_0 pypi msgpack-python 1.0.3 py310h2fea185_0 conda-forge nbclient 0.5.9 pyhd8ed1ab_0 conda-forge nbconvert 6.4.0 pypi_0 pypi nbformat 5.1.3 pyhd8ed1ab_0 conda-forge ncurses 6.2 h2e338ed_4 conda-forge nest-asyncio 1.5.4 pyhd8ed1ab_0 conda-forge nodeenv 1.6.0 pyhd8ed1ab_0 conda-forge notebook 6.4.6 pyha770c72_0 conda-forge numpy 1.22.0 pypi_0 pypi numpydoc 1.1.0 py_1 conda-forge olefile 0.46 pyh9f0ad1d_1 conda-forge openjpeg 2.4.0 h6e7aa92_1 conda-forge openssl 1.1.1l h0d85af4_0 conda-forge orc 1.7.2 h84518c8_0 conda-forge packaging 21.3 pyhd8ed1ab_0 conda-forge pandas 1.5.0.dev0+11.g8c21dce69d dev_0 pandoc 2.16.2 h0d85af4_0 conda-forge pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge parquet-cpp 1.5.1 1 conda-forge parso 0.8.3 pyhd8ed1ab_0 conda-forge partd 1.2.0 pyhd8ed1ab_0 conda-forge pbr 5.8.0 pyhd8ed1ab_1 conda-forge pexpect 4.8.0 pyh9f0ad1d_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 8.4.0 pypi_0 pypi pip 21.3.1 pyhd8ed1ab_0 conda-forge pluggy 1.0.0 pypi_0 pypi pre-commit 2.16.0 pypi_0 pypi prometheus_client 0.12.0 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.24 pyha770c72_0 conda-forge psutil 5.9.0 pypi_0 pypi ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge py 1.11.0 pyh6c4a22f_0 conda-forge pyarrow 6.0.1 pypi_0 pypi pycparser 2.21 pyhd8ed1ab_0 conda-forge pygments 2.11.1 pyhd8ed1ab_0 conda-forge pyopenssl 21.0.0 pyhd8ed1ab_0 conda-forge pyparsing 2.4.7 pyhd8ed1ab_1 conda-forge pyrsistent 0.18.0 pypi_0 pypi pysocks 1.7.1 pypi_0 pypi pytest 6.2.5 pypi_0 pypi pytest-cov 3.0.0 pyhd8ed1ab_0 conda-forge pytest-mock 3.6.1 pyhd8ed1ab_0 conda-forge python 3.10.1 h1248fe1_2_cpython conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-slugify 5.0.2 pyhd8ed1ab_0 conda-forge python-tzdata 2021.5 pyhd8ed1ab_0 conda-forge python-xxhash 2.0.2 py310he24745e_1 conda-forge python_abi 3.10 2_cp310 conda-forge pytz 2021.3 pyhd8ed1ab_0 conda-forge pytz-deprecation-shim 0.1.0.post0 pypi_0 pypi pyyaml 6.0 pypi_0 pypi pyzmq 22.3.0 pypi_0 pypi quantcore-thek 1.5.0.post6+gb4f8386.d20220106 dev_0 re2 2021.11.01 he49afe7_0 conda-forge readline 8.1 h05e3726_0 conda-forge requests 2.26.0 pyhd8ed1ab_1 conda-forge ruamel-yaml 0.17.19 pypi_0 pypi ruamel-yaml-clib 0.2.6 pypi_0 pypi ruamel.yaml 0.17.19 py310he24745e_0 conda-forge ruamel.yaml.clib 0.2.6 py310he24745e_0 conda-forge scipy 1.7.3 pypi_0 pypi send2trash 1.8.0 pyhd8ed1ab_0 conda-forge setuptools 60.2.0 pypi_0 pypi simplejson 3.17.6 pypi_0 pypi simplekv 0.14.1 pyh9f0ad1d_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge snappy 1.1.8 hb1e8313_3 conda-forge snowballstemmer 2.2.0 pyhd8ed1ab_0 conda-forge sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge sphinx 4.3.2 pyh6c4a22f_0 conda-forge sphinx_rtd_theme 1.0.0 pyhd8ed1ab_0 conda-forge sphinxcontrib-apidoc 0.3.0 py_1 conda-forge sphinxcontrib-applehelp 1.0.2 py_0 conda-forge sphinxcontrib-devhelp 1.0.2 py_0 conda-forge sphinxcontrib-htmlhelp 2.0.0 pyhd8ed1ab_0 conda-forge sphinxcontrib-jsmath 1.0.1 py_0 conda-forge sphinxcontrib-qthelp 1.0.3 py_0 conda-forge sphinxcontrib-serializinghtml 1.1.5 pyhd8ed1ab_1 conda-forge sqlite 3.37.0 h23a322b_0 conda-forge storefact 0.10.0 py_0 conda-forge tabulate 0.8.9 pyhd8ed1ab_0 conda-forge tblib 1.7.0 pyhd8ed1ab_0 conda-forge termcolor 1.1.0 py_2 conda-forge terminado 0.12.1 pypi_0 pypi testpath 0.5.0 pyhd8ed1ab_0 conda-forge text-unidecode 1.3 py_0 conda-forge tk 8.6.11 h5dbffcc_1 conda-forge toml 0.10.2 pyhd8ed1ab_0 conda-forge tomli 2.0.0 pyhd8ed1ab_1 conda-forge toolz 0.11.2 pyhd8ed1ab_0 conda-forge tornado 6.1 pypi_0 pypi tqdm 4.62.3 pyhd8ed1ab_0 conda-forge traitlets 5.1.1 pyhd8ed1ab_0 conda-forge typing_extensions 4.0.1 pyha770c72_0 conda-forge tzdata 2021e he74cb21_0 conda-forge tzlocal 4.1 pypi_0 pypi unidecode 1.3.2 pyhd8ed1ab_0 conda-forge uritools 4.0.0 pyhd8ed1ab_0 conda-forge urllib3 1.26.7 pyhd8ed1ab_0 conda-forge urlquote 1.1.4 pypi_0 pypi virtualenv 20.4.7 pypi_0 pypi wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge webencodings 0.5.1 py_1 conda-forge wheel 0.37.1 pyhd8ed1ab_0 conda-forge widgetsnbextension 3.5.2 pypi_0 pypi xxhash 2.0.2 pypi_0 pypi xz 5.2.5 haf1e3a3_1 conda-forge yaml 0.2.5 h0d85af4_2 conda-forge zeromq 4.3.4 he49afe7_1 conda-forge zict 2.0.0 py_0 conda-forge zipp 3.6.0 pyhd8ed1ab_0 conda-forge zlib 1.2.11 h9173be1_1013 conda-forge zstandard 0.16.0 pypi_0 pypi zstd 1.5.1 h582d3a0_0 conda-forge ```
xhochy commented 2 years ago

Is there anything that needs to be adressed regarding this in kartothek?

DamianBarabonkovQC commented 2 years ago

I have a hacky patch in filter_array_like that looks like:

    with np.errstate(invalid="ignore"):
        if op == "==":
            if pd.isnull(value):
                np.logical_and(pd.isnull(array_like), mask, out=out)
            else:
                res_eq = array_like == value
                np.logical_and(res_eq.fillna(False), mask, out=out)

basically filling in any NA with False during the comparison before giving it up to np.logical_and