Closed behrica closed 1 year ago
repo is here: https://github.com/behrica/libpython-clj--182/ instructions to reproduce: https://github.com/behrica/libpython-clj--182/blob/main/reproduce.txt
By playing with the output buffering and using the "unbuffer" command, i was finally to get a python stack trace of the error:
Fatal Python error: take_gil: PyMUTEX_LOCK(gil->mutex) failed
Python runtime state: initialized
Thread 0x00007fda8e738700 (most recent call first):
File "/usr/local/lib/python3.9/threading.py", line 316 in wait
File "/usr/local/lib/python3.9/threading.py", line 574 in wait
File "/home/user/.local/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
File "/usr/local/lib/python3.9/threading.py", line 973 in _bootstrap_inner
File "/usr/local/lib/python3.9/threading.py", line 930 in _bootstrap
Current thread 0x00007fdb13dfd700 (most recent call first):
File "/home/user/.local/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 778 in list_directory_v2
File "/home/user/.local/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 749 in list_directory
File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/resolver.py", line 366 in <lambda>
File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/resolver.py", line 371 in atomic_download
File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/compressed_module_resolver.py", line 67 in __call__
File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/registry.py", line 51 in __call__
File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/module_v2.py", line 47 in resolve
File "/home/user/.local/lib/python3.9/site-packages/tensorflow_hub/module_v2.py", line 92 in load
File "/home/user/Top2Vec/top2vec/Top2Vec.py", line 878 in _check_model_status
File "/home/user/Top2Vec/top2vec/Top2Vec.py", line 361 in __init__
Thread 0x00007fdbabfa9700 (most recent call first):
File "/usr/local/lib/python3.9/threading.py", line 312 in wait
File "/usr/local/lib/python3.9/threading.py", line 574 in wait
File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 296 in start_thread
File "/usr/local/lib/python3.9/threading.py", line 910 in run
File "/usr/local/lib/python3.9/threading.py", line 973 in _bootstrap_inner
File "/usr/local/lib/python3.9/threading.py", line 930 in _bootstrap
Thread 0x00007fdbb358b740 (most recent call first):
File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 856 in fn
File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 892 in call
File "/home/user/.local/lib/python3.9/site-packages/javabridge/jutil.py", line 961 in method
File "/home/user/.local/lib/python3.9/site-packages/cljbridge.py", line 50 in __call__
File "/home/user/.local/lib/python3.9/site-packages/cljbridge.py", line 73 in resolve_call_fn
File "/home/user/.local/lib/python3.9/site-packages/cljbridge.py", line 161 in init_clojure_repl
File "<string>", line 1 in <module>
That is extremely promising.
A google search of that error yields a lot of possibly pathways for further steps.
Ok, if you say so.
I experimented a bit with various "output buffering" related things and the "progress bar" in large downloads. (as we see tqdm in one of the threads)
so playing with: docker -ti vs docker TFHUB_DOWNLOAD_PROGRESS = 0 vs = 1 python vs unuffer python python -u vs python having the binaries preload vs not having them preloaded
Nothing got it working, but in some settings It went to "deadlock" (= hanging) instead of core dump. The above stacktrace I saw never again....
Hmm. I think you can combine the stack traces - the one dumped and the above python on to get a more full picture of what is going on.
Also if we change the libpython-clj code to indicate which pathway it is taking through handing the gil that may illuminate some things. Also there may be arguments to libjava.so to change how it handles signals which may be interfering here a bit.
This appears to be the same issue as #194.
I wonder if something is initializing multiple interpreters and that is making take-gil behave differently. Honestly the next step would be to carefully scan the code behind the python checkGil pathway and then with a debugger put memory watches on the mutex addresses. Then load torch or this library and see what happens - my guess is the memory breakpoints will be hit quite a bit. Perhaps we can do this outside of libpython-clj as the jvm itself causes quite a lot of annoying noise in the debugger.
In any case, closing this issue as it really is the same issue.
see discussion here: https://clojurians.zulipchat.com/#narrow/stream/215609-libpython-clj-dev/topic/core.20dump
I will provide repro with repoducible szenario soon.