clj-python / libpython-clj

Python bindings for Clojure
Eclipse Public License 2.0
1.05k stars 68 forks source link

It also means our exception handling is poor as exceptions escaping the context are quietly swallowed. #221

Closed behrica closed 1 year ago

behrica commented 1 year ago
    It also means our exception handling is poor as exceptions escaping the context are quietly swallowed.

Originally posted by @cnuernber in https://github.com/clj-python/libpython-clj/issues/219#issuecomment-1257295769

behrica commented 1 year ago

Just to not forget it.

I am bitten hardly by this now, as this type of error make the code hang, and no exception is printed. It can be easily reproduced y running in embed a script which tried to load a "non existing python module".

If I run it in "embedded mode", the below exception is not visible, and the python+JVM hangs for ever. I do see it, when running the same code in "non-embedded".

{:clojure.main/message
 "Syntax error compiling at (src/crash.clj:12:1).\nModuleNotFoundError: No module named 'not existing'\n\n",
 :clojure.main/triage
 {:clojure.error/phase :compile-syntax-check,
  :clojure.error/line 12,
  :clojure.error/column 1,
  :clojure.error/source "crash.clj",
  :clojure.error/path "src/crash.clj",
  :clojure.error/class java.lang.Exception,
  :clojure.error/cause
  "ModuleNotFoundError: No module named 'not existing'\n"},
 :clojure.main/trace
 {:via
  [{:type clojure.lang.Compiler$CompilerException,
    :message
    "Syntax error compiling at (/home/carsten/Dropbox/sources/predict_disaster_tweet/src/crash.clj:12:1).",
    :data
    {:clojure.error/phase :compile-syntax-check,
     :clojure.error/line 12,
     :clojure.error/column 1,
     :clojure.error/source
     "/home/carsten/Dropbox/sources/predict_disaster_tweet/src/crash.clj"},
    :at [clojure.lang.Compiler load "Compiler.java" 7652]}
   {:type java.lang.Exception,
    :message "ModuleNotFoundError: No module named 'not existing'\n",
    :at
    [libpython_clj2.python.ffi$check_error_throw
     invokeStatic
     "ffi.clj"
     708]}],
  :trace
  [[libpython_clj2.python.ffi$check_error_throw
    invokeStatic
    "ffi.clj"
    708]
   [libpython_clj2.python.ffi$check_error_throw invoke "ffi.clj" 706]
   [libpython_clj2.python$import_module invokeStatic "python.clj" 165]
   [libpython_clj2.python$import_module invoke "python.clj" 158]
   [train$eval14176 invokeStatic "crash.clj" 12]
   [train$eval14176 invoke "crash.clj" 12]
   [clojure.lang.Compiler eval "Compiler.java" 7181]
   [clojure.lang.Compiler load "Compiler.java" 7640]
   [clojure.lang.Compiler loadFile "Compiler.java" 7578]
   [clojure.main$load_script invokeStatic "main.clj" 475]
   [clojure.main$script_opt invokeStatic "main.clj" 535]
   [clojure.main$script_opt invoke "main.clj" 530]
   [clojure.main$main invokeStatic "main.clj" 664]
   [clojure.main$main doInvoke "main.clj" 616]
   [clojure.lang.RestFn applyTo "RestFn.java" 137]
   [clojure.lang.Var applyTo "Var.java" 705]
   [clojure.main main "main.java" 40]],
  :cause "ModuleNotFoundError: No module named 'not existing'\n",
  :phase :compile-syntax-check}}
behrica commented 1 year ago

I was not sure, if the fix would be here or in cljbridge, so I opened this before:

https://github.com/behrica/clojurebridge/issues/2

behrica commented 1 year ago

https://github.com/behrica/clojurebridge/issues/2 has some code snippets to reproduce it

behrica commented 1 year ago

@cnuernber Could you have a look at this ? I am not able to dig myself, too much native and macro magic for me.

Each time my code hangs, I am always wondering if there was a python exception, which I just don't see.

jjtolton commented 1 year ago

@behrica can you provide a little more context? "ModuleNotFoundError: No module named 'not existing' is a Python error, but 'not existing' is a very unusual module name. I feel like I would need more information to pinpoint why the error is not printing to the console, because I don't think I'd be able to reproduce this.

behrica commented 1 year ago

It happnes with "any exception" on python side, if using embedded mode and loading a script:

crash.clj;

(ns train
  (:require [libpython-clj2.python :refer [py.- py.] :as py]
            [libpython-clj2.python.ffi :as ffi]))
            ;; [libpython-clj2.require :refer [require-python]]
            ;; [libpython-clj2.python.gc]
(println "-------- manual-gil: " ffi/manual-gil)

(def locked (ffi/lock-gil))
(println :gil-locked)

(println :before-crash)
(py/import-module "not existing")  ;; whatever error on python side, like loading an non existing library
(println "crash fininished")

Load the file in embedded:

python -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="crash.clj",mvn_local_repo="/home/carsten/.m2/repository")'
behrica commented 1 year ago

It "hangs" forever without printing any error

cnuernber commented 1 year ago

This appears to me to be an issue with clojurebridge, not libpython-clj. Specifically javabridge's static call pathway must swallow exceptions.

Or I guess their make_method pathway.

behrica commented 1 year ago

It does not happen with normal Clojure / JVM exceptions, that why I am wondering.

cnuernber commented 1 year ago

Hmm. In that case something needs to check the python exception state and rethrow.

behrica commented 1 year ago

Using py/with-gil behaves "better" (and different depending on "manual gil or not")....

behrica commented 1 year ago

So this code:

(ns train
  (:require [libpython-clj2.python :refer [py.- py.] :as py]))

(py/with-gil
  (py/import-module "not existing"))
behrica commented 1 year ago

called like this:

export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=true";python -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="crash2.clj",mvn_local_repo="/home/carsten/.m2/repository")'

gives

ARNING: Please consider reporting this to the maintainers of tech.v3.datatype.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f92b2d87d7a, pid=38488, tid=38488
#
# JRE version: OpenJDK Runtime Environment (11.0.16.1+1) (build 11.0.16.1+1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.16.1+1, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libpython3.10.so.1.0+0x187d7a]  PyImport_Import+0x3a
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/carsten/Dropbox/sources/libpython-clj/core.38488)
#
# An error report file with more information is saved as:
# /home/carsten/Dropbox/sources/libpython-clj/hs_err_pid38488.log
behrica commented 1 year ago

and called like this

export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=false";python -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="crash2.clj",mvn_local_repo="/home/carsten/.m2/repository")'

gives an Java exception (which I could probably catch):

ARNING: All illegal access operations will be denied in a future release
Exception in thread "Thread-0" Syntax error macroexpanding at (/home/carsten/Dropbox/sources/libpython-clj/crash2.clj:5:1).
    at clojure.lang.Compiler.load(Compiler.java:7665)
    at clojure.lang.Compiler.loadFile(Compiler.java:7591)
    at clojure.lang.RT$3.invoke(RT.java:327)
    at clojure.lang.Var.invoke(Var.java:384)
Caused by: java.lang.Exception: ModuleNotFoundError: No module named 'not existing'

    at libpython_clj2.python.ffi$check_error_throw.invokeStatic(ffi.clj:708)
    at libpython_clj2.python.ffi$check_error_throw.invoke(ffi.clj:706)
    at libpython_clj2.python$import_module.invokeStatic(python.clj:165)
    at libpython_clj2.python$import_module.invoke(python.clj:158)
    at train$eval15837.invokeStatic(crash2.clj:6)
    at train$eval15837.invoke(crash2.clj:5)
    at clojure.lang.Compiler.eval(Compiler.java:7194)
    at clojure.lang.Compiler.load(Compiler.java:7653)
    ... 3 more
Sep 30, 2022 10:30:39 AM clojure.tools.logging$eval3239$fn__3242 invoke
INFO: Reference thread starting
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/carsten/.local/lib/python3.10/site-packages/clojurebridge/cljbridge.py", line 187, in load_clojure_file
    resolve_call_fn("clojure.core/load-file",
  File "/home/carsten/.local/lib/python3.10/site-packages/clojurebridge/cljbridge.py", line 73, in resolve_call_fn
    return resolve_fn(namespaced_fn_name)(*args)
  File "/home/carsten/.local/lib/python3.10/site-packages/clojurebridge/cljbridge.py", line 50, in __call__
    return invoker(*args)
  File "/home/carsten/.local/lib/python3.10/site-packages/javabridge/jutil.py", line 961, in method
    result = call(self.o, name, sig, *args)
  File "/home/carsten/.local/lib/python3.10/site-packages/javabridge/jutil.py", line 892, in call
    result = fn(*nice_args)
  File "/home/carsten/.local/lib/python3.10/site-packages/javabridge/jutil.py", line 859, in fn
    raise JavaException(x)
javabridge.jutil.JavaException: Syntax error macroexpanding at (/home/carsten/Dropbox/sources/libpython-clj/crash2.clj:5:1).

We can probably agree, that this is the expected behavior.

A JVM exception and neither "jvm crash" nor "jvm hanging".

behrica commented 1 year ago

And we have the correct behaviour in the REPL, in "Non-embedded"

behrica commented 1 year ago

We have as well correct behaviour "in embedded mode from the REPL".

The "load_clojure_file" makes the issue:

python -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="crash2.clj",mvn_local_repo="/home/carsten/.m2/repositor

This load_clojure_file pathway was added by me to clojurebridge, and it was therefore never tested by you, I suppose.

behrica commented 1 year ago

It is documented here: https://clj-python.github.io/libpython-clj/embedded.html

chapter: Loading and running a Clojure file in embedded mode

behrica commented 1 year ago

The new entry point is here: https://github.com/behrica/clojurebridge/blob/2c3c6e6c6d3d74f61aa2d68d7bf1c770d76b09cc/src/clojurebridge/cljbridge.py#L173

I think I do the "same" as the other entry point: https://github.com/behrica/clojurebridge/blob/2c3c6e6c6d3d74f61aa2d68d7bf1c770d76b09cc/src/clojurebridge/cljbridge.py#L142

behrica commented 1 year ago

To make it more confusing, there is an other pathway I made as well, which loads a fixed "user.clj" file.

This prints as well teh exception.....

python -c 'from clojurebridge import cljbridge;cljbridge.init_jvm(load_user_clj=True,mvn_local_repo="/home/carsten/.m2/repository")'

(I propose to later remove this load_user_clj pathway, as it is superseded by the load_clojure_file)

So lets ignore this for the moment,

I created: https://github.com/behrica/clojurebridge/issues/5

behrica commented 1 year ago

Let's concentrated on the "hanging" one, which is worst from a user experience point of view:

crash.clj

(ns train
  (:require [libpython-clj2.python :refer [py.- py.] :as py]
            [libpython-clj2.python.ffi :as ffi]))
            ;; [libpython-clj2.require :refer [require-python]]
            ;; [libpython-clj2.python.gc]
(println "-------- manual-gil: " ffi/manual-gil)

(def locked (ffi/lock-gil))
(println :gil-locked)

(println :before-crash)
(py/import-module "not existing")  ;; whatever error on python side, like loading an non existing library
(println "crash finished")

run as

export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=false";python -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="crash.clj",mvn_local_repo="/home/carsten/.m2/repository")'

results in hanging for ever

expected behaviour A nice JVM exception:

Exception in thread "Thread-0" Syntax error macroexpanding at (/home/carsten/Dropbox/sources/libpython-clj/crash.clj:5:1).
    at clojure.lang.Compiler.load(Compiler.java:7665)
    at clojure.lang.Compiler.loadFile(Compiler.java:7591)
    at clojure.lang.RT$3.invoke(RT.java:327)
    at clojure.lang.Var.invoke(Var.java:384)
Caused by: java.lang.Exception: ModuleNotFoundError: No module named 'not existing'
behrica commented 1 year ago

(same hanning happens without calling "lock/gil")

and independent of the "manual_gil") setting, so we get an even easier hanging szenario:

behrica commented 1 year ago

You might ask, why do I play with "manul_gil" and GIL locking. Because this makes python library simpletransformers work more reliable (proven by repeated executions and counting training sucecsses)

But by doing so, I experience now the issue we discuss. If I get an python exception in my code with "manuel GIL handling", it hangs.

As a "workaround" I now run it without manual GIL (to see if working on little data) and then switch to "with manual GIL" for train on full data, which is annoying to do.

Especially because the instability of simpletransformers might result in "hanging" as well (or JVM crash)

behrica commented 1 year ago

Even simpler code t reproduce:

Any exception thronw by pythonresult in "hanging"

(ns train
  (:require [libpython-clj2.python :refer [py.- py.] :as py]
            [libpython-clj2.python.ffi :as ffi]))

(def gil  (ffi/lock-gil))
(py/run-simple-string "invalid syntax !!!")
(ffi/unlock-gil gil)

clled like this:

export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=true"
python -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="crash.clj",mvn_local_repo="/home/carsten/.m2/repository")'
cnuernber commented 1 year ago

Well, I stand corrected :-). That definitely appears to be a bug in libpython-clj.

Perhaps the unlock-gil code needs check-error-throw. Hmm - that needs to happen nearly after every python call - it was handled by with-gil and it should still be happening.

behrica commented 1 year ago

I found a way to at least print the error, but it still hang

diff --git a/src/libpython_clj2/python/ffi.clj b/src/libpython_clj2/python/ffi.clj
index 773d1f9..ec46e55 100644
--- a/src/libpython_clj2/python/ffi.clj
+++ b/src/libpython_clj2/python/ffi.clj
@@ -706,7 +706,8 @@ Each call must be matched with PyGILState_Release"}
 (defn check-error-throw
   []
   (when-let [error-str (check-error-str)]
-    (throw (Exception. ^String error-str))))
+    (do (println :error error-str)
+        (throw (Exception. ^String error-str)))))
behrica commented 1 year ago

It gets as well solved by unlocking the gil in a finally

(try
  (py/run-simple-string "1/0")
  (finally
    (ffi/unlock-gil gil)))

I noticed already before, that in embedded mode and loading a clj file, the JVM does not terminate without the unlock

So the "hanging" we see is maybe the following:

The exception:

  1. is thrown
  2. not printed for whatever reason
  3. gil/unlock is not called
  4. code finishes 5, JVM hangs
behrica commented 1 year ago

so maybe it is simpler to find out and fix that in the following the VM does not terminate:

(ns train
  (:require [libpython-clj2.python :refer [py.- py.] :as py]
            [libpython-clj2.python.ffi :as ffi]))

(def gil  (ffi/lock-gil))

(py/run-simple-string "print('ok')")
behrica commented 1 year ago

I saw that in the hanging state some thread seems to wait forever:

Name: tech.resource.gc ref thread
State: TIMED_WAITING on java.lang.ref.ReferenceQueue$Lock@28f1bb74
Total blocked: 1  Total waited: 87,568

Stack trace: 
java.base@11.0.16.1/java.lang.Object.wait(Native Method)
java.base@11.0.16.1/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:155)
tech.v3.resource.gc$watch_reference_queue$fn__10795.invoke(gc.clj:27)
tech.v3.resource.gc$watch_reference_queue.invokeStatic(gc.clj:23)
tech.v3.resource.gc$watch_reference_queue.invoke(gc.clj:21)
tech.v3.resource.gc$start_reference_thread$fn__10807.invoke(gc.clj:47)
app//clojure.lang.AFn.run(AFn.java:22)
java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829)

not sure, if relevant

cnuernber commented 1 year ago

The java api has a try-with-resources pattern -

(with-open [glock (java_api/GILLocker)]
  ...)

That will automatically do the try-finally. Anything auto-closeable will be closed in a with-open macro.

cnuernber commented 1 year ago

There is a new API in latest (2.020) - with-manual-gil that will automatically lock/unlock the gil.

user> ffi/manual-gil
true
user> (py/with-manual-gil
        (py/run-simple-string "!! syntax error"))
Execution error at libpython-clj2.python.ffi/check-error-throw (ffi.clj:708).
  File "<string>", line 1
    !! syntax error
    ^
SyntaxError: invalid syntax
user> 

Outputs: