clj-python / libpython-clj

Python bindings for Clojure
Eclipse Public License 2.0
1.08k stars 68 forks source link

libpython-clj blocks occasionaly using python simpletransformers library #219

Closed behrica closed 1 year ago

behrica commented 2 years ago

It is a kind of FUP of #194

Similar code, same python library "simpletransformers"

But as I don't see a crash, but a deadlock, I think it is better to have a new issue. Maybe they are related.

I have made a rminimal reproducible scenario here, including the data.

https://github.com/behrica/libpython-clj-219

behrica commented 2 years ago

I would say, it hangs 1 out of 10 runs.

I think I can increase the likelihood of hanging by enabling the progress printing of the model: https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L27

behrica commented 2 years ago

Not sure, if using CUDA or not, https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L26 makes a difference, I saw the issue with and without GPU usage.

behrica commented 2 years ago

I did not have the impression, that changing the "amount" of training data had an influence. https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L19

I never hangs in "the middle of teh training", but in the "beginning", it seems.

behrica commented 2 years ago

Running the run.sh should produce the deadlock reliably, I think: https://github.com/behrica/libpython-clj-219/blob/main/run.sh

behrica commented 2 years ago

I added a script which counts, how often the invocations hangs. It uses the linux timeout command and logs the signals to a file. Signal = 0 , -> training worked signal 137, it times out after 60 seconds (for me it finishes always around 30 s, if not hanging)

https://github.com/behrica/libpython-clj-219/blob/main/countHang.sh

behrica commented 2 years ago

with that I have "typicaly", 1 out of 10 failing

behrica commented 2 years ago

The "issue" is far less frequent , when I disable "io-redirect" in libpython.clj have it enabled makes : 16 / 100 failed have it disabled : 1/ 100 failed

I was always suspecting that the issue is somehow related to the fact if and how the python processes print to console.

cnuernber commented 2 years ago

If torch is using the multi threading module that will definitely cause it. We know in embedded mode we see those error messages from it…

behrica commented 2 years ago

But is there anythng we can do about it ? Is the io-redirect maybe "too simple" ? I saw some blog posts about redirect stdout from C extensions, and they are more involved then what we do. https://eli.thegreenplace.net/2015/redirecting-all-kinds-of-stdout-in-python/

cnuernber commented 2 years ago

I think redirecting stdout is a red herring - turning off torch multithreafing I think is key

cnuernber commented 2 years ago

If we want to be bold we could perhaps provide our own implementation of the threading module.

cnuernber commented 2 years ago

Actually, in embedded mode with manual gil locking you should not see any issues like this. The multithreading module works in embedded mode. And you are right, the io redirection we do is at the python library level, not the C level. But again I think that is not why things are hanging although if a buffer is getting full and not flushed maybe.

Remember in embedded mode you don't need to acquire/release the GIL at all. The python layer above you has done all of this so really in embedded mode manual gil pathways should be the norm.

behrica commented 2 years ago

Ok, I try that.

I disabled as well torch multithreading, this seems to make issues in some python environments even, like Colab

What is the way to use properly the manual gil pathway from Clojure ? Calling the java_api or just set the property to enable it ?

cnuernber commented 2 years ago

From pure clojure - You have to have the manual_gil property set when the python.ffi namespace is first compiled. The only reliable way I found is to start the vm with the property enabled. Then just use the ffi namespace functions instead of the java api functions. The initialization is the same either way.

Be sure to check the variable - ffi/manual-gil and make sure it is true. That tells you how the namespace was compiled.

behrica commented 2 years ago

Ok, I am not sure, if I do the right thing. I am not in a repl situation, but I load my clojure file like this now:

export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=true"
python -u -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="src/train.clj",mvn_local_repo="/home/carsten/.m2/repository")'

The train.clj looks now like this:

(ns train
  (:require
   [libpython-clj2.python.ffi :as ffi]
   [libpython-clj2.embedded :as embedded]))

(println "-------- manual-gil: " ffi/manual-gil)     ;; prints "true"
(embedded/initialize!)
(ffi/lock-gil)

;;do my training  with simpletransformer
;; disable all multiprocessor as much as I can via model_args

(ffi/unlock-gil)
(println "training finished")

The training started, lets see.

cnuernber commented 2 years ago

You do not need the lock/unlock pairs. When embedded the enclosing python already owns the gil.

behrica commented 2 years ago

Hmm. Without the (ffi/lock-gil) I get a crash:

RNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000555f60e8bcd7, pid=85946, tid=85946
#
# JRE version: OpenJDK Runtime Environment (11.0.16.1+1) (build 11.0.16.1+1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.16.1+1, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [python+0x12dcd7]  PyEval_GetGlobals+0x7
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/carsten/Dropbox/sources/predict_disaster_tweet/core.85946)
#
# An error report file with more information is saved as:
# /home/carsten/Dropbox/sources/predict_disaster_tweet/hs_err_pid85946.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
behrica commented 2 years ago

If I leave it in, training starts and finishes, but the JVM process never ends. It hangs forever on (ffi/unlock-gil)

cnuernber commented 2 years ago

ffi/unlock-gil takes the return value of ffi/lock-gil.

cnuernber commented 2 years ago

The crash that lock-gil must be necessary. That is really confusing -- does the code in clojurebridge unlock the gil?

cnuernber commented 2 years ago

It also means our exception handling is poor as exceptions escaping the context are quietly swallowed.

behrica commented 2 years ago

My call to "initalize" created the problem. (as it was initialized already in clojurebridge). But I need to look / unlock , then it works.

(ns train
  (:require
   [libpython-clj2.python.ffi :as ffi]
   [libpython-clj2.embedded :as embedded]))

(println "-------- manual-gil: " ffi/manual-gil)     ;; prints "true"

(def x (ffi/lock-gil))

;;do my training  with simpletransformer
;; disable all multiprocessor as much as I can via model_args

(ffi/unlock-gil x)
(println "training finished")
behrica commented 2 years ago

????

If I write (ffi/unlock-gil) without the old state, there is no compile error, bu it hangs ..... How can that be ? The function has 1-arrity ....

behrica commented 2 years ago

I only get the exception in repl, but not execution the clj file:

 Unhandled clojure.lang.ArityException
   Wrong number of args (0) passed to: libpython-clj2.python.ffi/unlock-gil

                  AFn.java:  429  clojure.lang.AFn/throwArity
                  AFn.java:   28  clojure.lang.AFn/invoke
                      REPL:   94  train/eval35092
                      REPL:   94  train/eval35092
             Compiler.java: 7181  clojure.lang.Compiler/eval
             Compiler.java: 7136  clojure.lang.Compiler/eval
                  core.clj: 3202  clojure.core/eval
                  core.clj: 3198  clojure.core/eval
    interruptible_eval.clj:   87  nrepl.middleware.interruptible-eval/evaluate/fn/fn
                  AFn.java:  152  clojure.lang.AFn/applyToHelper
                  AFn.java:  144  clojure.lang.AFn/applyTo
                  core.clj:  667  clojure.core/apply
cnuernber commented 2 years ago

Right - we need to fix that but that is separate issue. It escapes the clojurebridge context somehow and isn't reported.

behrica commented 2 years ago

yes... it seems so.

behrica commented 2 years ago

Ok, my training works now with small model and liittle data, I will redo with large model and full data.

behrica commented 2 years ago

I can confirm the stability for larger data as well.

behrica commented 2 years ago

I adapted the code here: https://github.com/behrica/libpython-clj-219

to use "embed" and it runs indeed stable. 0 / 100 failures

I will repeat it with more data and larger model, but it looks promising already.

behrica commented 2 years ago

I opened 2 issues in clojurebridge, which should improve the experience of using "embedded libpython-clj with manual GIL management" https://github.com/behrica/clojurebridge/issues/4 https://github.com/behrica/clojurebridge/issues/2

behrica commented 2 years ago

But we should have as well more documentation on it.

The current docu of embedded focuses as well to the "repl" scenario (wile I run a clj file), and does not mention "manual GIL" neither. And it is not very clear, that manual GIL management is as well usable without the Java API for libpython-clj.

behrica commented 2 years ago

The issue is solved. :smile: Doing these 3 things:

  1. Embedded
  2. manual GIL
  3. configure simpletransformers for "singlecore"

maks the training run rock solid using libpython-clj.

100 / 100 non-trivial trainings worked (taking a couple of hours on a Tesla GPU)

cnuernber commented 2 years ago

I hesitate to ask this but given 1,2 is 3 necessary?

behrica commented 2 years ago

I am not 100%, if it is needed. I will test it. I just saw comments that multiprocessing makes issue even in certain python environments, for example running on Colab and resulting in "hanging".

behrica commented 2 years ago

I think 3) is not needed. My tests failed 2 / 100.

with / without 3) the same.

I will consider it stable at this point in time, probably pure python has the same error rate... Maybe I write a python equivalent one day.

We should just properly document the correct usage of embedded and manual GIL together.