crash on take-jil - Githubissues

behrica commented 2 years ago

see here:

https://clojurians.zulipchat.com/#narrow/stream/215609-libpython-clj-dev/topic/JVM.20crash.20on.20'take.20gil'

behrica commented 2 years ago

What is true and worth to mention, is that for "serious" machine learning, with long running and a parameterized learning jobs, it is not feasible to use any form of REPL in any case. (neither interactive notebooks)

From my experience this requires always a more structured "experiment tracking", for which I have used tools like DVC. https://dvc.org/

And DVC requires that the training is a "script", which for Clojure translates to the clj file case, which is working fine with simple transformer

So maybe we need to "live" with the fact that certain python libraries are kind of impossible to be used from Clojure REPL via libpython-clj

I am not sure, if we should document this in some form. So that we expect that libpython-clj should work in general better in the non REPL situations (= clj files)

behrica commented 2 years ago

To be very clear on the former comment:

In my experience ML should and can start typically in the REPL as a "first try" (small /subset of data, fast trainings , standard hyper parameters) and maybe then we move (if having some form of success) into a "script".

So it is of course a pity (and requires understanding this and change of habits) that certain "models" (example simple transformer via libpython-clj) do not work in REPL at all. -> We needed to do the first experiments using scripts already.

behrica commented 2 years ago

Maybe we should start eventually maintain a list of "known problematic libraries" for libpython-clj. We know from the discussions above that libpython-clj is (even in embedded mode) is not the same as cpython .

behrica commented 2 years ago

At least such a list could "close" issues such as this one.

jjtolton commented 2 years ago

To be very clear on the former comment:

In my experience ML should and can start typically in the REPL as a "first try" (small /subset of data, fast trainings , standard hyper parameters) and maybe then we move (if having some form of success) into a "script".

So it is of course a pity (and requires understanding this and change of habits) that certain "models" (example simple transformer via libpython-clj) do not work in REPL at all.

-> We needed to do the first experiments using scripts already.

I wouldn't give up on the REPL experience so quickly. I promise you I'll still be using the REPL -- but just sometimes I'll be doing some work in the Python interpreter instead of the Clojure REPL if stability or thread safety is an issue.

That being said, it isn't well documented and it does require some proficiency in both toolsets. I can at least document it.

cnuernber commented 1 year ago

@jjtolton - The manual gil branch works reliably for me. IS that the case for you?

cnuernber commented 1 year ago

In fact I can train repeatedly and things are fine aside from the at_fork_reinit logging issues noted above:

Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'
Exception ignored in: <function _after_fork at 0x7f0173b02790>
Traceback (most recent call last):
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 1486, in _after_fork
    thread._reset_internal_locks(True)
  File "/opt/conda/envs/st/lib/python3.9/threading.py", line 829, in _reset_internal_locks
    self._tstate_lock._at_fork_reinit()
AttributeError: 'NoneType' object has no attribute '_at_fork_reinit'

I googled this and found some issues but I can't tell if the fix for those are in python 3.9 or not. Also it does appear like the multithreading module is in use which is extremely suspect.

cnuernber commented 1 year ago

@behrica - that branch, manual gil, should not work. The manual gil system variable is a compile time variable for libpython-clj. This means that your with-properties macro body should not work at all - check-gil should be being called all the time regardless. It is utterly confusing why this is repeatedly working from the REPL in this branch for me. Perhaps just chance.

-- and as I said that it crashed --

cnuernber commented 1 year ago

Getting the manual-gil setting correctly work from my normal repl makes everything work so far.

cnuernber commented 1 year ago

We should really see if this is robust across the board. We need to figure out how to tell javabridge to start the jvm with a system property set. You can see if manual gil management is enabled by checking the variable libpython-clj2.python.ffi/manual-gil. It wil lbe set to true if the library is compiled with manual gil management.

For some libraries it may just be the case that the automatic check-gil/capture-gil pathway that we use is just not or stable; there could be a race condition there which seems incredible to me but I can't think of another reason why this would be so crazy and infuriating.

behrica commented 1 year ago

Getting the manual-gil setting correctly work from my normal repl makes everything work so far.

I do agree, that by using manual gil management in the repl, and being "very careful", things work. I first though that "repeated training is a problem", but it is not the case.

So I can interactively use simple transformer, which is nice.

behrica commented 1 year ago

@jjtolton - The manual gil branch works reliably for me. IS that the case for you?

Only "one time".

If in the same repl sessions you call (load-file "simpletransformers.clj") twice, I get a crash.

But this is not due to "train twice", it is somethiong in the

(java_api/initialize nil)
(java_api/lockGIL)
(require '[libpython-clj2.require :as py-req]
         '[libpython-clj2.python.io-redirect]
         '[libpython-clj2.python :refer [py.- py.] :as py])

(py/from-import simpletransformers.classification ClassificationModel)
(py-req/require-python '[pandas :as pd])

;;do train

(java_api/unlockGIL 1)

which does not like to be called "twice" in the same repl session.

behrica commented 1 year ago

We should really see if this is robust across the board. We need to figure out how to tell javabridge to start the jvm with a system property set.

One workaround is to use this in the Docker or outside: ENV _JAVA_OPTIONS "-Dlibpython_clj.manual_gil=true"

The javabridge JVM picks it up

behrica commented 1 year ago

Getting the manual-gil setting correctly work from my normal repl makes everything work so far.

I do agree, that by using manual gil management in the repl, and being "very careful", things work. I first though that "repeated training is a problem", but it is not the case.

So I can interactively use simple transformer, which is nice.

But indeed, only with "manual GIL management".

With automatic GIL management , i can no get it working interactively.

With manual GIL management and being carefull (which means not to repeat the various imports and lock the GIL at least ones)

-> We need to pay attention on both (call lock-gil and not repeat the imports / requires)

Even with manual GIL management, I could bring the sessions in a "state" where the training would crash (by repeating the imports / requires, for example)

behrica commented 1 year ago

In a certain way this is satisfactory to me. Maybe we should document somewhere, that some python libraries work "better" interactively with manual GIL management and avoid to repeat "require-python and similar imports".

behrica commented 1 year ago

My current code mixes in some form as well the "java API" and the "clojure API". Can we uniform this ? So that Clojure supports as well fully the "manual GIL" management ?

cnuernber commented 1 year ago

Yes of course - the java api is in fact clojure and really just calls down into the clojure code.

The other python bindings from java lock python access to a specific thread. I state this in various places; it isn't helpful for REPL behavior really but if you have to run this in a production context then I think that is reasonable.

cnuernber commented 1 year ago

Hmm - I think there are a few parts to fix this

The first is clear documentation - I can write a new topic on exactly this.
Could we get a smaller program that always crashes regardless? If you just put (py/import-module "simpletransformers.classification") in a loop does that crash even with manual gil management? What about if (System/gc) followed by (gc/clear-reference-queue) to force module dereferences?
Another direction - could we make sure that require-python is a noop if it has already run and :reload isn't specified? I mean, isn't it already?

Perhaps libpython-clj itself in import-module should track modules and avoid releasing/reimporting them? OR perhaps the reference counting on import-module isn't correct and thus the simpletransformers module is getting completely released which may be something that would just never happen in normal python code.

behrica commented 1 year ago

Regarding 2) One of the issues is that lib-pythonclj behaves differently regarding this, depending on "how and where" you call it from.

clj file
repl
embeded libpython-clj with cljbridge.load_clojure_file(clj_file="simpletransformers.clj")
embeded libpython-clj via connected nrepl

behrica commented 1 year ago

I have now a "Deadlock" in the "train call", with larger data. Standard Clojure libpython-clj

Reproducible with my data, every time running as clj src/train.clj

cnuernber commented 1 year ago

With manual gil?

Most likely they are kicking off the multithreading module which is forking the process which is failing.

behrica commented 1 year ago

Maybe we are getting somewhere... This crashes reliably, without manual GIL.

(ns train
  (:require [libpython-clj2.python :refer [py.- py.] :as py]
            [libpython-clj2.require :refer [require-python]]
            [libpython-clj2.python.gc]))

(require-python '[simpletransformers.classification
                  :as classification])

(System/gc)
(libpython-clj2.python.gc/clear-reference-queue)

behrica commented 1 year ago

Depending if called in repl or command line, it sometimes hangs or crashes, but never runs through.

cnuernber commented 1 year ago

So there is another variable - object-reference-logging in the ffi namespace. Setting it to true should allow you to see the last object freed before the crash.

The require-python statement really loads a ton of objects and all of these then get released. My guess is if you did (py/import-module "simpletransformers.classification") that would not crash.

behrica commented 1 year ago

Changing to import-module `(def classification (py/import-module "simpletransformers.classification")) gives an exception:

21:24:06.111 [main] ERROR libpython-clj2.python.ffi - Exception while releasing object
java.lang.IllegalStateException: Recursive update
    at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1739)
    at libpython_clj2.python.ffi$pyobject_type_kwd.invokeStatic(ffi.clj:810)
    at libpython_clj2.python.ffi$pyobject_type_kwd.invoke(ffi.clj:807)
    at libpython_clj2.python.ffi$wrap_obj_ptr$fn__8977.invoke(ffi.clj:845)
    at libpython_clj2.python.gc$track$fn__8386.invoke(gc.clj:37)
    at tech.resource.GCReference.run(GCReference.java:25)
    at libpython_clj2.python.gc$clear_reference_queue.invokeStatic(gc.clj:52)
    at libpython_clj2.python.gc$clear_reference_queue.invoke(gc.clj:49)
    at train$eval14435.invokeStatic(train.clj:11)
    at train$eval14435.invoke(train.clj:11)
    at clojure.lang.Compiler.eval(Compiler.java:7181)
    at clojure.lang.Compiler.load(Compiler.java:7640)
    at clojure.lang.Compiler.loadFile(Compiler.java:7578)
    at clojure.main$load_script.invokeStatic(main.clj:475)
    at clojure.main$script_opt.invokeStatic(main.clj:535)
    at clojure.main$script_opt.invoke(main.clj:530)
    at clojure.main$main.invokeStatic(main.clj:664)
    at clojure.main$main.doInvoke(main.clj:616)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at clojure.lang.Var.applyTo(Var.java:705)
    at clojure.main.main(main.java:40)

but, it continues....

`

cnuernber commented 1 year ago

huh...what was released? IF the classification variable is bound then that cannot be what is released...I don't think this is the root of the issue but it is certainly a serious issue.

behrica commented 1 year ago

1:30:16.323 [main] INFO libpython-clj2.python.ffi - tracking object  - 0x7f36140ccdc0:3573:dict
21:30:16.323 [main] INFO libpython-clj2.python.ffi - tracking object  - 0x7f34fb76bd90:   6:type
21:30:16.323 [main] INFO libpython-clj2.python.ffi - tracking object  - 0x7f34e31bd3b0:   4:module
21:30:16.324 [main] INFO libpython-clj2.python.ffi - tracking object  - 0x7f34da3c2c70:   4:module
21:30:16.324 [main] INFO libpython-clj2.python.ffi - tracking object  - 0x7f34f57d9090:   4:module
21:30:16.379 [main] INFO libpython-clj2.python.ffi - releasing object - 0x7f34da3e8c10:   1::method-wrapper
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f35ae533f00, pid=158138, tid=158148
#
# JRE version: OpenJDK Runtime Environment (11.0.16.1+1) (build 11.0.16.1+1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.16.1+1, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libpython3.9.so+0x133f00]  _PyTrash_begin+0x0
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/carsten/Dropbox/sources/predict_disaster_tweet/core.158138)
#
# An error report file with more information is saved as:
# /home/carsten/Dropbox/sources/predict_disaster_tweet/hs_err_pid158138.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

cnuernber commented 1 year ago

I think the gil has to captured for clear-reference-queue. I don't think that is a method that captures the gil automatically.

behrica commented 1 year ago

yes, I did that. Then the above code (without training), runs.

behrica commented 1 year ago

Once I add the training to id, I get again all kind of issues, mainly hanging this time, not crashing.

behrica commented 1 year ago

threaddump-1664051627417.tdump.gz

behrica commented 1 year ago

I attached a thread dump while it is hanging.

behrica commented 1 year ago

I propose to close this issue. The "crashing" is solved by using libpython-clj with:

embedded mode
manual GIL management
configure python library to not use multiprocessing extensively (multiprocessing makes apparently as well problems in certain cpython runtimes such as Colab))

I keep here a discussion #219 about a very occasionally blocking and refer to some example code which is 99% working, as reference.

clj-python / libpython-clj

crash on take-jil #194