Closed behrica closed 1 year ago
I would say, it hangs 1 out of 10 runs.
I think I can increase the likelihood of hanging by enabling the progress printing of the model: https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L27
Not sure, if using CUDA or not, https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L26 makes a difference, I saw the issue with and without GPU usage.
I did not have the impression, that changing the "amount" of training data had an influence. https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L19
I never hangs in "the middle of teh training", but in the "beginning", it seems.
Running the run.sh
should produce the deadlock reliably, I think:
https://github.com/behrica/libpython-clj-219/blob/main/run.sh
I added a script which counts, how often the invocations hangs.
It uses the linux timeout
command and logs the signals to a file.
Signal = 0 , -> training worked
signal 137, it times out after 60 seconds (for me it finishes always around 30 s, if not hanging)
https://github.com/behrica/libpython-clj-219/blob/main/countHang.sh
with that I have "typicaly", 1 out of 10 failing
The "issue" is far less frequent , when I disable "io-redirect" in libpython.clj have it enabled makes : 16 / 100 failed have it disabled : 1/ 100 failed
I was always suspecting that the issue is somehow related to the fact if and how the python processes print to console.
If torch is using the multi threading module that will definitely cause it. We know in embedded mode we see those error messages from it…
But is there anythng we can do about it ? Is the io-redirect maybe "too simple" ? I saw some blog posts about redirect stdout from C extensions, and they are more involved then what we do. https://eli.thegreenplace.net/2015/redirecting-all-kinds-of-stdout-in-python/
I think redirecting stdout is a red herring - turning off torch multithreafing I think is key
If we want to be bold we could perhaps provide our own implementation of the threading module.
Actually, in embedded mode with manual gil locking you should not see any issues like this. The multithreading module works in embedded mode. And you are right, the io redirection we do is at the python library level, not the C level. But again I think that is not why things are hanging although if a buffer is getting full and not flushed maybe.
Remember in embedded mode you don't need to acquire/release the GIL at all. The python layer above you has done all of this so really in embedded mode manual gil pathways should be the norm.
Ok, I try that.
I disabled as well torch multithreading, this seems to make issues in some python environments even, like Colab
What is the way to use properly the manual gil pathway from Clojure ? Calling the java_api or just set the property to enable it ?
From pure clojure - You have to have the manual_gil property set when the python.ffi namespace is first compiled. The only reliable way I found is to start the vm with the property enabled. Then just use the ffi namespace functions instead of the java api functions. The initialization is the same either way.
Be sure to check the variable - ffi/manual-gil
and make sure it is true
. That tells you how the namespace was compiled.
Ok, I am not sure, if I do the right thing. I am not in a repl situation, but I load my clojure file like this now:
export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=true"
python -u -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="src/train.clj",mvn_local_repo="/home/carsten/.m2/repository")'
The train.clj looks now like this:
(ns train
(:require
[libpython-clj2.python.ffi :as ffi]
[libpython-clj2.embedded :as embedded]))
(println "-------- manual-gil: " ffi/manual-gil) ;; prints "true"
(embedded/initialize!)
(ffi/lock-gil)
;;do my training with simpletransformer
;; disable all multiprocessor as much as I can via model_args
(ffi/unlock-gil)
(println "training finished")
The training started, lets see.
You do not need the lock/unlock pairs. When embedded the enclosing python already owns the gil.
Hmm. Without the (ffi/lock-gil) I get a crash:
RNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000555f60e8bcd7, pid=85946, tid=85946
#
# JRE version: OpenJDK Runtime Environment (11.0.16.1+1) (build 11.0.16.1+1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.16.1+1, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C [python+0x12dcd7] PyEval_GetGlobals+0x7
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/carsten/Dropbox/sources/predict_disaster_tweet/core.85946)
#
# An error report file with more information is saved as:
# /home/carsten/Dropbox/sources/predict_disaster_tweet/hs_err_pid85946.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
If I leave it in, training starts and finishes, but the JVM process never ends.
It hangs forever on
(ffi/unlock-gil)
ffi/unlock-gil takes the return value of ffi/lock-gil.
The crash that lock-gil must be necessary. That is really confusing -- does the code in clojurebridge unlock the gil?
It also means our exception handling is poor as exceptions escaping the context are quietly swallowed.
My call to "initalize" created the problem. (as it was initialized already in clojurebridge). But I need to look / unlock , then it works.
(ns train
(:require
[libpython-clj2.python.ffi :as ffi]
[libpython-clj2.embedded :as embedded]))
(println "-------- manual-gil: " ffi/manual-gil) ;; prints "true"
(def x (ffi/lock-gil))
;;do my training with simpletransformer
;; disable all multiprocessor as much as I can via model_args
(ffi/unlock-gil x)
(println "training finished")
????
If I write (ffi/unlock-gil)
without the old state, there is no compile error, bu it hangs .....
How can that be ?
The function has 1-arrity ....
I only get the exception in repl, but not execution the clj file:
Unhandled clojure.lang.ArityException
Wrong number of args (0) passed to: libpython-clj2.python.ffi/unlock-gil
AFn.java: 429 clojure.lang.AFn/throwArity
AFn.java: 28 clojure.lang.AFn/invoke
REPL: 94 train/eval35092
REPL: 94 train/eval35092
Compiler.java: 7181 clojure.lang.Compiler/eval
Compiler.java: 7136 clojure.lang.Compiler/eval
core.clj: 3202 clojure.core/eval
core.clj: 3198 clojure.core/eval
interruptible_eval.clj: 87 nrepl.middleware.interruptible-eval/evaluate/fn/fn
AFn.java: 152 clojure.lang.AFn/applyToHelper
AFn.java: 144 clojure.lang.AFn/applyTo
core.clj: 667 clojure.core/apply
Right - we need to fix that but that is separate issue. It escapes the clojurebridge context somehow and isn't reported.
yes... it seems so.
Ok, my training works now with small model and liittle data, I will redo with large model and full data.
I can confirm the stability for larger data as well.
I adapted the code here: https://github.com/behrica/libpython-clj-219
to use "embed" and it runs indeed stable. 0 / 100 failures
I will repeat it with more data and larger model, but it looks promising already.
I opened 2 issues in clojurebridge, which should improve the experience of using "embedded libpython-clj with manual GIL management" https://github.com/behrica/clojurebridge/issues/4 https://github.com/behrica/clojurebridge/issues/2
But we should have as well more documentation on it.
The current docu of embedded focuses as well to the "repl" scenario (wile I run a clj file), and does not mention "manual GIL" neither. And it is not very clear, that manual GIL management is as well usable without the Java API for libpython-clj.
The issue is solved. :smile: Doing these 3 things:
maks the training run rock solid using libpython-clj.
100 / 100 non-trivial trainings worked (taking a couple of hours on a Tesla GPU)
I hesitate to ask this but given 1,2 is 3 necessary?
I am not 100%, if it is needed. I will test it. I just saw comments that multiprocessing makes issue even in certain python environments, for example running on Colab and resulting in "hanging".
I think 3) is not needed. My tests failed 2 / 100.
with / without 3) the same.
I will consider it stable at this point in time, probably pure python has the same error rate... Maybe I write a python equivalent one day.
We should just properly document the correct usage of embedded and manual GIL together.
It is a kind of FUP of #194
Similar code, same python library "simpletransformers"
But as I don't see a crash, but a deadlock, I think it is better to have a new issue. Maybe they are related.
I have made a rminimal reproducible scenario here, including the data.
https://github.com/behrica/libpython-clj-219