apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.5k forks source link

[Python] Error repeating df.to_parquet in pytest: "pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined" #41857

Open bjfar opened 4 months ago

bjfar commented 4 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Python version: 3.10.14 pyarrow version: 16.1.0 pandas version: 2.2.2 pytest version: 8.2.1

I have some apparently niche circumstances that trigger the following error:

/home/benf/repos/tetra/python/tests/test_minimal.py:24: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/util/_decorators.py:333: in wrapper
    return func(*args, **kwargs)
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/core/frame.py:3113: in to_parquet
    return to_parquet(
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:476: in to_parquet
    impl = get_engine(engine)
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:63: in get_engine
    return engine_class()
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/io/parquet.py:169: in __init__
    import pandas.core.arrays.arrow.extension_types  # pyright: ignore[reportUnusedImport] # noqa: F401
/home/benf/micromamba/envs/tetra/lib/python3.10/site-packages/pandas/core/arrays/arrow/extension_types.py:59: in <module>
    pyarrow.register_extension_type(_period_type)
pyarrow/types.pxi:1954: in pyarrow.lib.register_extension_type
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined

pyarrow/error.pxi:91: ArrowKeyError
========================================================= short test summary info =========================================================
FAILED python/tests/test_minimal.py::test_pyarrow_issue_2 - pyarrow.lib.ArrowKeyError: A type extension with name pandas.period already defined

It seems to have something to do with how pytest orchestrates its tests. Here is my minimal example:

test_minimal.py

import pytest
import pandas as pd

pytest_plugins = ["pytester"]

def test_pyarrow_issue(testdir, tmp_path):
    path = str(tmp_path / "test.tar")
    df = pd.DataFrame()
    df.to_parquet(path)

def test_pyarrow_issue_2(testdir, tmp_path):
    path = str(tmp_path / "test_2.tar")
    df = pd.DataFrame()
    df.to_parquet(path)

Running pytest test_minimal.py then triggers the error.

Notably, the error does not occur if either test is run independently, and it does not occur if the testdir fixture is removed or replaced with some other fixture. So I guess it has something to do with whatever testdir is doing under the hood. Presumably to do with how pandas/pyarrow get imported.

In my real case I would really quite like to keep using the testdir fixture, though I can probably find a different way to do things. But nonetheless this behaviour seemed worth reporting. Not sure if it is a pyarrow issue though, or whether it is more of a pytest issue, or maybe even pandas.

Component(s)

Parquet, Python

jorisvandenbossche commented 3 months ago

@bjfar thanks for the report! This is a bit bizarre .. So when pandas' to_parquet gets called for the first time, pandas will call pyarrow.register_extension_type(..) to register its extension types. This is defined in a python submodule in pandas, so I would expect that normal python execution will ensure this code from importing the submodule is only run once.

But maybe that assumption is not true in all cases, or the pytest fixture meddle with the import mechanism? In any case, if we should protect this from happening, that's something that needs to be done on the pandas side. Would you want to report it there?