NREL / REopt_API

The model for the REopt API, which is used as the back-end for the REopt Webtool (reopt.nrel.gov/tool), and can be accessed directly via the NREL Developer Network (https://developer.nrel.gov/docs/energy-optimization/reopt)
https://developer.nrel.gov/docs/energy-optimization/reopt
Other
91 stars 47 forks source link

Jobs getting "stuck" in `Optimizing...` status #183

Closed NLaws closed 3 years ago

NLaws commented 3 years ago

Some jobs never solve and it appears to be due to lost workers, probably related to a Julia bug. The traceback on a local host is:

Assertion failed: (jl_is_method_instance(mi)), function emit_invoke, file /Users/julia/buildbot/worker/package_macos64/build/src/codegen.cpp, line 2770.

signal (6): Abort trap: 6
in expression starting at none:0
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 5928179621 (Pool: 5927020734; Big: 1158887); GC: 1084
[2021-01-27 20:48:14,519: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:24295 exited with 'signal 6 (SIGABRT)'
[2021-01-27 20:48:14,577: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT).',)
Traceback (most recent call last):
  File "/Users/nlaws/projects/reopt_api/env/lib/python3.6/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT).
celery.worker.request ERROR    request.py::on_failure line 540 Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT).',)
Traceback (most recent call last):
  File "/Users/nlaws/projects/reopt_api/env/lib/python3.6/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT).

We see something similar on the SUSE Linux servers.

The failing line in Julia matches https://github.com/JuliaLang/julia/issues/37694, which might be related to https://github.com/JuliaLang/julia/issues/35580. The latter issue appears to be fixed in Julia 1.6.0-DEV.1399, so hopefully we can fix this as soon as 1.6 is released.

NLaws commented 3 years ago

On the SUSE servers, since we are using a Julia system image(?) the traceback is different:

python3: /buildworker/worker/package_linux64/build/src/codegen.cpp:3322: jl_cgval_t emit_invoke(jl_codectx_t&, jl_expr_t*, jl_value_t*): Assertion `(((jl_value_t*)(((jl_taggedvalue_t*)((char*)(mi) - sizeof(jl_taggedvalue_t)))->header & ~(uintptr_t)15))==(jl_value_t*)(jl_method_instance_type))' failed.

signal (6): Aborted
in expression starting at none:0
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__assert_fail_base at /lib64/libc.so.6 (unknown line)
__assert_fail at /lib64/libc.so.6 (unknown line)
emit_invoke at /buildworker/worker/package_linux64/build/src/codegen.cpp:3322
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4139
emit_ssaval_assign at /buildworker/worker/package_linux64/build/src/codegen.cpp:3851
emit_stmtpos at /buildworker/worker/package_linux64/build/src/codegen.cpp:4044 [inlined]
emit_function at /buildworker/worker/package_linux64/build/src/codegen.cpp:6671
jl_compile_linfo at /buildworker/worker/package_linux64/build/src/codegen.cpp:1257
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1890
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2154 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
map at ./abstractarray.jl:2098
container at /home/deploy/.julia/packages/JuMP/YXK4e/src/Containers/container.jl:85
container at /home/deploy/.julia/packages/JuMP/YXK4e/src/Containers/container.jl:65
unknown function (ip: 0x7f7aa6c43a68)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
macro expansion at /home/deploy/.julia/packages/JuMP/YXK4e/src/macros.jl:79 [inlined]
add_tech_size_constraints at /srv/data/apps/reopt_api/main/releases/20210114225712/reo/src/reopt.jl:478
reopt_run at /srv/data/apps/reopt_api/main/releases/20210114225712/reo/src/reopt.jl:858
reopt at /srv/data/apps/reopt_api/main/releases/20210114225712/reo/src/reopt.jl:812
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2159 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:643
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:693
#invokelatest#1 at ./essentials.jl:712
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:643
invokelatest at ./essentials.jl:711
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:643
_pyjlwrap_call at /home/deploy/.julia/packages/PyCall/zqDXB/src/callback.jl:28
unknown function (ip: 0x7f7aa6bca302)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
pyjlwrap_call at /home/deploy/.julia/packages/PyCall/zqDXB/src/callback.jl:49
jfptr_pyjlwrap_call_31754 at /srv/data/apps/reopt_api/main/releases/20210114225712/julia_envs/Xpress/JuliaXpressSysimage.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2323
jlcapi_pyjlwrap_call_31669 at /srv/data/apps/reopt_api/main/releases/20210114225712/julia_envs/Xpress/JuliaXpressSysimage.so (unknown line)
_PyObject_FastCallDict at /lib64/libpython3.6m.so.1.0 (unknown line)
unknown function (ip: 0x7f7ade0c2d5b)
_PyEval_EvalFrameDefault at /lib64/libpython3.6m.so.1.0 (unknown line)
unknown function (ip: 0x7f7ade0c1d76)
_PyFunction_FastCallDict at /lib64/libpython3.6m.so.1.0 (unknown line)
_PyObject_FastCallDict at /lib64/libpython3.6m.so.1.0 (unknown line)
_PyObject_Call_Prepend at /lib64/libpython3.6m.so.1.0 (unknown line)
PyObject_Call at /lib64/libpython3.6m.so.1.0 (unknown line)
_PyEval_EvalFrameDefault at /lib64/libpython3.6m.so.1.0 (unknown line)
unknown function (ip: 0x7f7ade0c1fa0)
_PyFunction_FastCallDict at /lib64/libpython3.6m.so.1.0 (unknown line)
_PyObject_FastCallDict at /lib64/libpython3.6m.so.1.0 (unknown line)

...

NLaws commented 3 years ago

Addressed in production on March 17th with move to Rancher cluster. Fixes are in https://github.com/NREL/REopt_Lite_API/pull/198

jmpohl commented 3 years ago

@NLaws Excited to see this change! Seems like the best way forward. PyCall/pyJulia always seemed troublesome to work with.