JuliaLang / IJulia.jl

Julia kernel for Jupyter
MIT License
2.78k stars 409 forks source link

Jupyter Lab does not kill spawned workers and deallocate memory when shutdown is issued to Julia kernel #1067

Open davorh opened 1 year ago

davorh commented 1 year ago

Julia kernel in Jupyter Lab does not kill kernel and deallocate memory when shutdown is issued

I do not know if is this an IJulia i.e. Julia kernel issue or some interplay of Jupyter Lab and Julia kernel, but it's currently a huge productivity issue that is not related to Julia core installation. The problem was tracked down to

using Distributed

in Jupyter Lab with IJulia kernel. Detailed info:

  1. Info on architecture:

The output of versioninfo() Julia Version 1.8.5 Commit 17cfb8e65ea (2023-01-08 06:45 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 88 × Intel(R) Xeon(R) Gold 6238T CPU @ 1.90GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake) Threads: 1 on 88 virtual cores

  1. Installation procedure

Julia was installed by downloading

wget https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.5-linux-x86_64.tar.gz

in /opt/, unpacked and a symlinked julia -> /opt/julia-1.8.5/bin/julia.

As user, latest minicionda and Juplyter Lab was installed as described (conda-forge etc.)

In Julia, IJulia package was installed, So:

The installed version of Jupyter lab is 3.5.0, for IJulia we have "IJulia" => v"1.24.0"

  1. Test example:

Launch notebook with Julia 1.8.5 kernel, Create memory intensive variable and workers and define variable on all of them:

using Distributed
using LinearAlgebra
N=10_000
A=rand(N,N);
addprocs(9);
@everywhere begin
    N=10_000
    A=rand(N,N);
end

after kernel shutdown is issued workers stay as active processes even if Jupyter Lab is closed.

To be more precise it kills the main kernel but processes

/opt/julia/bin/julia -Cnative -J/opt/julia/lib/julia/sys.so -g1 --color=yes --bind-to 127.0.0.1 --worker

stay alive. As I mentioned this does not happen if I run code in terminal in julia prompt. Our computations utilise around 800GB of RAM per run so this represents a huge issue.

davorh commented 1 year ago

I did some more testing on Jupyter Lab 3.5.0 and Julia 1.8.5 on Windows 10 works fine. On another Linux machine, I had Jupyter Lab 2.1.0 and Julia 1,6, and all works fine. On the same machine Jupyter Lab 3.5.0 and Julia 1.6, we have output listed at the end after test code

using Distributed
using LinearAlgebra

addprocs(9);
@everywhere begin
   N=10_000
   A=rand(N,N);
end

was executed and kernel shutdown was issued (all workers were shutdown correctly):

      From worker 5:    fatal: error thrown and no exception handler available.
      From worker 5:    InterruptException()
      From worker 5:    jl_mutex_unlock at /opt/julia/src/locks.h:134 [inlined]
      From worker 5:    jl_task_get_next at /opt/julia/src/partr.c:475
      From worker 7:    fatal: error thrown and no exception handler available.
      From worker 7:    InterruptException()
      From worker 7:    jl_mutex_unlock at /opt/julia/src/locks.h:134 [inlined]
      From worker 7:    jl_task_get_next at /opt/julia/src/partr.c:475
      From worker 10:   fatal: error thrown and no exception handler available.
      From worker 10:   InterruptException()
      From worker 10:   jl_mutex_unlock at /opt/julia/src/locks.h:134 [inlined]
      From worker 10:   jl_task_get_next at /opt/julia/src/partr.c:475
      From worker 9:    fatal: error thrown and no exception handler available.
      From worker 9:    InterruptException()
      From worker 9:    jl_mutex_unlock at /opt/julia/src/locks.h:134 [inlined]
      From worker 9:    jl_task_get_next at /opt/julia/src/partr.c:475
      From worker 6:    fatal: error thrown and no exception handler available.
      From worker 6:    InterruptException()
      From worker 6:    jl_mutex_unlock at /opt/julia/src/locks.h:134 [inlined]
      From worker 6:    jl_task_get_next at /opt/julia/src/partr.c:475
      From worker 8:    fatal: error thrown and no exception handler available.
      From worker 8:    InterruptException()
      From worker 8:    jl_mutex_unlock at /opt/julia/src/locks.h:134 [inlined]
      From worker 8:    jl_task_get_next at /opt/julia/src/partr.c:475
      From worker 10:   poptask at ./task.jl:755
      From worker 5:    poptask at ./task.jl:755
      From worker 6:    poptask at ./task.jl:755
      From worker 9:    poptask at ./task.jl:755
      From worker 7:    poptask at ./task.jl:755
      From worker 8:    poptask at ./task.jl:755
      From worker 10:   wait at ./task.jl:763 [inlined]
      From worker 10:   task_done_hook at ./task.jl:489
      From worker 9:    wait at ./task.jl:763 [inlined]
      From worker 9:    task_done_hook at ./task.jl:489
      From worker 7:    wait at ./task.jl:763 [inlined]
      From worker 7:    task_done_hook at ./task.jl:489
      From worker 6:    wait at ./task.jl:763 [inlined]
      From worker 6:    task_done_hook at ./task.jl:489
      From worker 5:    wait at ./task.jl:763 [inlined]
      From worker 5:    task_done_hook at ./task.jl:489
      From worker 8:    wait at ./task.jl:763 [inlined]
      From worker 8:    task_done_hook at ./task.jl:489
      From worker 10:   _jl_invoke at /opt/julia/src/gf.c:2237 [inlined]
      From worker 10:   jl_apply_generic at /opt/julia/src/gf.c:2419
      From worker 5:    _jl_invoke at /opt/julia/src/gf.c:2237 [inlined]
      From worker 5:    jl_apply_generic at /opt/julia/src/gf.c:2419
      From worker 9:    _jl_invoke at /opt/julia/src/gf.c:2237 [inlined]
      From worker 9:    jl_apply_generic at /opt/julia/src/gf.c:2419
      From worker 7:    _jl_invoke at /opt/julia/src/gf.c:2237 [inlined]
      From worker 7:    jl_apply_generic at /opt/julia/src/gf.c:2419
      From worker 6:    _jl_invoke at /opt/julia/src/gf.c:2237 [inlined]
      From worker 6:    jl_apply_generic at /opt/julia/src/gf.c:2419
      From worker 10:   jl_apply at /opt/julia/src/julia.h:1703 [inlined]
      From worker 10:   jl_finish_task at /opt/julia/src/task.c:208
      From worker 8:    _jl_invoke at /opt/julia/src/gf.c:2237 [inlined]
      From worker 8:    jl_apply_generic at /opt/julia/src/gf.c:2419
      From worker 10:   start_task at /opt/julia/src/task.c:850
      From worker 10:   unknown function (ip: (nil))
      From worker 5:    jl_apply at /opt/julia/src/julia.h:1703 [inlined]
      From worker 5:    jl_finish_task at /opt/julia/src/task.c:208
      From worker 9:    jl_apply at /opt/julia/src/julia.h:1703 [inlined]
      From worker 9:    jl_finish_task at /opt/julia/src/task.c:208
      From worker 6:    jl_apply at /opt/julia/src/julia.h:1703 [inlined]
      From worker 6:    jl_finish_task at /opt/julia/src/task.c:208
      From worker 7:    jl_apply at /opt/julia/src/julia.h:1703 [inlined]
      From worker 7:    jl_finish_task at /opt/julia/src/task.c:208
      From worker 9:    start_task at /opt/julia/src/task.c:850
      From worker 9:    unknown function (ip: (nil))
      From worker 5:    start_task at /opt/julia/src/task.c:850
      From worker 5:    unknown function (ip: (nil))
      From worker 7:    start_task at /opt/julia/src/task.c:850
      From worker 7:    unknown function (ip: (nil))
      From worker 6:    start_task at /opt/julia/src/task.c:850
      From worker 6:    unknown function (ip: (nil))
      From worker 8:    jl_apply at /opt/julia/src/julia.h:1703 [inlined]
      From worker 8:    jl_finish_task at /opt/julia/src/task.c:208
      From worker 8:    start_task at /opt/julia/src/task.c:850
      From worker 8:    unknown function (ip: (nil))
      From worker 3:    InterruptException:
      From worker 3:    Stacktrace:
sprig commented 3 months ago

I experience this as well; Running inside a container based on jupyter/datascience-notebook:julia-1.9.3

inside jupyter:

> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
  Threads: 2 on 8 virtual cores
Environment:
  JULIA_PKGDIR = /opt/julia
  JULIA_DEPOT_PATH = /opt/julia
$ jupyter lab version
4.0.7
$ python3 --version
Python 3.11.5

MWE:

using Distributed
addprocs(Sys.CPU_THREADS-2)