CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Fix failing tests with Pandas 1.3.0 (was: Add logic to deal with ABCIndexClass being renamed to ABCIndex) #218

Closed frreiss closed 2 years ago

frreiss commented 3 years ago

Pandas 1.3.x renamed its abstract base class for indexes from ABCIndexClass to ABCIndex, which is messing up some of our type checks. This PR adds some logic to work around that renaming. I'm using exceptions as control flow, which is a bit ugly, but the alternatives are also ugly.

frreiss commented 3 years ago

Okay, that didn't work. Will try something else.

frreiss commented 2 years ago

Fixed some bugs in SpanArray and TokenSpanArray that new regression tests in Pandas 1.3.0 brought to light. Now we're down to 62 failing tests.

frreiss commented 2 years ago

Update: Most of the test failures seem to be due to a bug in Pandas 1.3.0. Due to some performance optimizations introduced in https://github.com/pandas-dev/pandas/pull/40353, Pandas turns DataFrame.iloc[slice(x, y, z)] into __getitem__((..., slice(x, y, z)) on the ExtensionArray that backs any column defined with an extension type.

Code to reproduce:

from pandas.api.extensions import ExtensionArray,ExtensionDtype

class MyExtensionDtype(ExtensionDtype):
    """Minimal extension dtype"""
    def __init__(self):
        pass

    @property
    def type(self):
        return int

    @property
    def name(self) -> str:
        return "MyExtensionDtype"

    @classmethod
    def construct_array_type(cls):
        return MyExtensionArray()

class MyExtensionArray(ExtensionArray, ExtensionScalarOpsMixin):
    """Minimal extension array that logs calls to __getitem__()"""
    @property
    def dtype(self):
        return MyExtensionDtype()

    def copy(self):
        return MyExtensionArray()

    def __len__(self):
        return 5

    def __getitem__(self, key):
        print(f"__getitem__ called with key '{key}'")
        return 42

arr = MyExtensionArray()
df = pd.DataFrame({"a": arr})
_ = df.iloc[:3]

which prints out:

__getitem__ called with key '(Ellipsis, slice(None, 3, None))'

It should print the following instead:

__getitem__ called with key 'slice(None, 3, None)'

I'll put in a workaround tomorrow and file a bug with Pandas.

FYI @BryanCutler @ZachEichen @PokkeFe @Crushellini

frreiss commented 2 years ago

Update:

frreiss commented 2 years ago

Update: Fixed another minor bug. Now we are down to 49 failing tests.

frreiss commented 2 years ago

Update:

Now we're down to 2 failing tests.

frreiss commented 2 years ago

All tests passing against Pandas 1.3.0 now. Merging this PR to unblock other PRs.