apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.62k stars 3.55k forks source link

[Python] RuntimeError when using pyarrow from a thread which is not joined before main exits when pandas is installed #35237

Open jleibs opened 1 year ago

jleibs commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

This is a relatively straightforward problem in which a thread that is continuing to run during shutdown tries to register an atexit handler.

This only happens if the pandas library is installed causing the associated shims to be used. This happens regardless of whether or not pandas is in-use by the application.

The problem can be avoided by making sure to join all theads before main exits, but this is not generally required by python so should be considered a bug.

Context to reproduce:

requirements.txt

pandas==2.0.0
pyarrow==11.0.0

main.py

import threading
import pyarrow

def use_pyarrow() -> None:
    table = pyarrow.table({"a": [1, 2, 3]})

def main() -> None:
    t = threading.Thread(target=use_pyarrow, args=())
    t.start()

if __name__ == "__main__":
    main()

Run:

$ python main.py 
Traceback (most recent call last):
  File "pyarrow/pandas-shim.pxi", line 100, in pyarrow.lib._PandasAPIShim._check_import
  File "pyarrow/pandas-shim.pxi", line 48, in pyarrow.lib._PandasAPIShim._import_pandas
  File "/home/jleibs/pyarrow-repro/venv/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 24, in <module>
    import concurrent.futures.thread  # noqa
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 37, in <module>
    threading._register_atexit(_python_exit)
  File "/usr/lib/python3.10/threading.py", line 1504, in _register_atexit
    raise RuntimeError("can't register atexit after shutdown")
RuntimeError: can't register atexit after shutdown
Exception ignored in: 'pyarrow.lib._PandasAPIShim._have_pandas_internal'
Traceback (most recent call last):
  File "pyarrow/pandas-shim.pxi", line 100, in pyarrow.lib._PandasAPIShim._check_import
  File "pyarrow/pandas-shim.pxi", line 48, in pyarrow.lib._PandasAPIShim._import_pandas
  File "/home/jleibs/pyarrow-repro/venv/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 24, in <module>
    import concurrent.futures.thread  # noqa
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 37, in <module>
    threading._register_atexit(_python_exit)
  File "/usr/lib/python3.10/threading.py", line 1504, in _register_atexit
    raise RuntimeError("can't register atexit after shutdown")
RuntimeError: can't register atexit after shutdown

Component(s)

Python

westonpace commented 1 year ago

As a workaround, does it work if you add:

from pyarrow.pandas_compat import _pandas_api

at the top level of your program (before main shuts down)?

This seems to be a transitive consequence of https://github.com/python/cpython/issues/86813#issuecomment-1246097184

AlenkaF commented 11 months ago

Just to note, I was able to reproduce the error with pandas==2.1.1 and dev version of pyarrow (15.0.0.dev). The issue doesn't happen if I add the import as Weston suggested at the top level of the program.

jorisvandenbossche commented 11 months ago

cc @pitrou

pitrou commented 11 months ago

This should be trivial to workaround in PyArrow.