clj-python / libpython-clj

Python bindings for Clojure
Eclipse Public License 2.0
1.05k stars 68 forks source link

libpython crashes with core dump #182

Closed behrica closed 1 year ago

behrica commented 2 years ago

see discussion here: https://clojurians.zulipchat.com/#narrow/stream/215609-libpython-clj-dev/topic/core.20dump

I will provide repro with repoducible szenario soon.

behrica commented 2 years ago

repo is here: https://github.com/behrica/libpython-clj--182/ instructions to reproduce: https://github.com/behrica/libpython-clj--182/blob/main/reproduce.txt

behrica commented 2 years ago

By playing with the output buffering and using the "unbuffer" command, i was finally to get a python stack trace of the error:

 Fatal Python error: take_gil: PyMUTEX_LOCK(gil->mutex) failed
Python runtime state: initialized

Thread 0x00007fda8e738700 (most recent call first):
  File "/usr/local/lib/python3.9/threading.py", line 316 in wait
  File "/usr/local/lib/python3.9/threading.py", line 574 in wait
  File "/home/user/.local/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/local/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/usr/local/lib/python3.9/threading.py", line 930 in _bootstrap

Current thread 0x00007fdb13dfd700 (most recent call first):
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 778 in list_directory_v2
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 749 in list_directory
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/resolver.py", line 366 in <lambda>
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/resolver.py", line 371 in atomic_download
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/compressed_module_resolver.py", line 67 in __call__
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/registry.py", line 51 in __call__
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/module_v2.py", line 47 in resolve
  File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/module_v2.py", line 92 in load
  File "/home/user/Top2Vec/top2vec/Top2Vec.py", line 878 in _check_model_status
  File "/home/user/Top2Vec/top2vec/Top2Vec.py", line 361 in __init__

Thread 0x00007fdbabfa9700 (most recent call first):
  File "/usr/local/lib/python3.9/threading.py", line 312 in wait
  File "/usr/local/lib/python3.9/threading.py", line 574 in wait
  File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 296 in start_thread
  File "/usr/local/lib/python3.9/threading.py", line 910 in run
  File "/usr/local/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/usr/local/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fdbb358b740 (most recent call first):
  File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 856 in fn
  File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 892 in call
  File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 961 in method
  File "/home/user/.local/lib/python3.9/site-packages/cljbridge.py", line 50 in __call__
  File "/home/user/.local/lib/python3.9/site-packages/cljbridge.py", line 73 in resolve_call_fn
  File "/home/user/.local/lib/python3.9/site-packages/cljbridge.py", line 161 in init_clojure_repl
  File "<string>", line 1 in <module>
cnuernber commented 2 years ago

That is extremely promising.

A google search of that error yields a lot of possibly pathways for further steps.

behrica commented 2 years ago

Ok, if you say so.

I experimented a bit with various "output buffering" related things and the "progress bar" in large downloads. (as we see tqdm in one of the threads)

so playing with: docker -ti vs docker TFHUB_DOWNLOAD_PROGRESS = 0 vs = 1 python vs unuffer python python -u vs python having the binaries preload vs not having them preloaded

Nothing got it working, but in some settings It went to "deadlock" (= hanging) instead of core dump. The above stacktrace I saw never again....

cnuernber commented 2 years ago

Hmm. I think you can combine the stack traces - the one dumped and the above python on to get a more full picture of what is going on.

Also if we change the libpython-clj code to indicate which pathway it is taking through handing the gil that may illuminate some things. Also there may be arguments to libjava.so to change how it handles signals which may be interfering here a bit.

cnuernber commented 1 year ago

This appears to be the same issue as #194.

I wonder if something is initializing multiple interpreters and that is making take-gil behave differently. Honestly the next step would be to carefully scan the code behind the python checkGil pathway and then with a debugger put memory watches on the mutex addresses. Then load torch or this library and see what happens - my guess is the memory breakpoints will be hit quite a bit. Perhaps we can do this outside of libpython-clj as the jvm itself causes quite a lot of annoying noise in the debugger.

In any case, closing this issue as it really is the same issue.