JuliaLang / IJulia.jl

Julia kernel for Jupyter
MIT License
2.78k stars 409 forks source link

Kernel frequently restarts during sessions #1037

Closed lewisl closed 2 years ago

lewisl commented 2 years ago

This has been reported many times and usually attributed to configuration errors.

Let me describe:

  1. I have reinstalled Julia using 1.7.2
  2. I have reinstalled Python using 3.10.4 from Python Software Foundation (Anaconda and miniconda are too large and they are still on 3.9).
  3. I have updated all Python packages associated with jupyter and notebooks.
  4. I have updated all Julia packages.
  5. I have rebuilt IJulia setting ENV["JUPYTER"] to the correct location.
  6. I have created a new kernelspec using IJulia which is in ~/Library/Jupyter/kernels/julia-4-threads-1.7 and kernel.json is
    {
    "display_name": "Julia 4 threads 1.7.2",
    "argv": [
    "/Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia",
    "-i",
    "--color=yes",
    "/Users/lewis/.julia/packages/IJulia/AQu2H/src/kernel.jl",
    "{connection_file}"
    ],
    "language": "julia",
    "env": {
    "JULIA_NUM_THREADS": "4"
    },
    "interrupt_mode": "signal"
    }

The connection between jupyterlab (really notebooks) is frequently lost. This outcome is highly inconsistent. Sometimes it will run for hours. Sometimes, it fails the first time it is used.

Basically, I think the inmemory pipe from the notebook process to Julia fails to make a successful connection after a few (3-4) invocations. Hunches:

  1. At first, I thought the problem was having more than one Julia REPL session running in external terminals. So, I would only run the terminal session needed to start jupyter lab from the command line. This seemed to work once and then this, too, experienced the same failure.
  2. It is possible that the instance of Julia started by the Julia Language server access from Visual Studio Code confuses the process communication. This can be ruled out: the problem occurs when I start jupyterlab from Julia as using IJulia; jupyterlab() with NO instance of VS Code running.
  3. It is very likely that the communication problem occurs because I am running the Apple Silicon version of Julia and there is, as yet, no Python build for Apple Silicon (actually, Homebrew might have one but I have not looked or tried it if it exists).
  4. The internal version of notebooks in Visual Studio Code is a nice idea with some nice features but is remarkably slow and even more unstable. I have given up on it. I doubt that there are many Julia users at Microsoft (even in Microsoft Research) so I wouldn't expect much help there.
  5. I get the best results by shutting down the Mac, doing a cold boot, and only running Jupyterlab in its own browser instance.

I suspect the investigation will need to be done from the Julia side and might take core debugging, which is a bear, since it's unlikely Python folks will give this much concern though perhaps the Jupyter maintainers will be more supportive.

It is tempting to say it must be my Julia code. It is a very big package with a moderate amount of data. But, as suspect as much of my code is, I can run everything in a terminal session over and over with no hangups at all.

In fact, my current approach--to get some benefit from notebooks--is to run a jupytext .jl script file based on the notebook using VS Code's "cell by cell" for example:

# %%
vaxsym = :Pfizer
dose1 = count(length.(locdat.vaxrcvd[last.(locdat.vaxrcvd) .== vaxsym]) .== 1)
dose2 = count(length.(locdat.vaxrcvd[last.(locdat.vaxrcvd) .== vaxsym]) .== 2)
println(vaxsym, " 1 dose cnt: ", dose1, " 2 dose cnt: ", dose2, " total ", 2*dose2 + dose1, " people cnt: ", dose1 + dose2)

# %% 

This works ok using shift-option-enter to step through executing cells. I first setup an external Julia REPL because the internal Julia REPL can encounter problems with VS Code's internal terminal. (Really do like VS Code--best game in town--but it takes them a while to get new features fully stabilized.)

It would be really nice to get this sorted out and I'll help with my limited skills (in low-level coding). Notebooks are helpful for sequential testing of code sequences (to see intermediate results), to document results, and to document "how-to's".

And, no, Pluto is not the answer. I've tried and it's still a bit too simplistic and imposes too many restrictions--chiefly one variable outcome per cell--for it to be practical. I also think 'Pure Julia' is not always the best answer. It is good that some tools can enable multiple language implementations--this is to be encouraged.

lewisl commented 2 years ago

This still occurs unpredictably. One day it will work and another day it will not.

I can't even clarify the differences in contexts where it works and where it does not. This feels like some sort of memory leak or hand-shaking failure with ZMQ on the Julia side.

there are two obvious immediate pre-cursors: either a cell appears to be running forever and probably Jupyter terminates the kernel for exceeding some time threshhold for a reply or the kernel simply restarts.

Would the log messages from the terminal session where Jupyter starts be helpful. Here are two suspicious looking things:

[W 2022-05-07 15:36:52.300 LabApp] Could not determine jupyterlab build status without nodejs
[W 2022-05-07 15:37:04.317 ServerApp] Notebook notebooks/test variant.ipynb is not trusted

And

[I 2022-05-07 15:40:05.017 ServerApp] Adapting from protocol version 5.0 (kernel b2deefc6-4398-46c9-add1-034aee317a17) to 5.3 (client).
[I 2022-05-07 15:40:05.030 ServerApp] Adapting from protocol version 5.0 (kernel b2deefc6-4398-46c9-add1-034aee317a17) to 5.3 (client).
Starting kernel event loops.

Could the mismatch in protocol versions be causing this?

stevengj commented 2 years ago

See the troubleshooting section of the manual.

I haven’t seen this problem myself.

lewisl commented 2 years ago

I added a message that extending Jupyter's timeout for kernels to 20 seconds seemed to solve the problem when attempting to run Plots for the first time.

I am sorry to report that this no longer works.

It would appear that success or failure is very nearly random. The only solution is to reboot my Mac. Since the process relies on some version of ZeroMQ, many memory-related issues are likely to be the culprit. I still suspect some kind of memory leak.

Try to use Plots and see if that doesn't cause the kernel to die for you while Plots spins compiling for the first time.

I'll soon be switching away from Plots (much as I truly like it) and will see if it is the culprit, but it is the only thing that seems to kill the Julia kernel in otherwise quite elaborate notebooks.

lewisl commented 2 years ago

So, the problem appears to be memory usage. Either Julia itself, IJulia and Jupyter appears to have a massive memory leak.

  1. Why not Julia? I can run the same sequence of function calls in the REPL and the REPL/Julia never end abnormally.
  2. The memory usage that Jupyter reports using jupyter-resource-usage extension simply continues to grow. At a bit over 2G the kernel will die.
  3. Running GC.gc() after "big" memory operations has no apparent effect. The resource usage reported by Jupyter is unchanged. Running empty!(Out) has no apparent effect--the reported resource usage is unchanged. This isn't surprising because I return important output to Julia variables, so there are Julia globals that contain this data.
  4. varinfo() is the way to see how much memory is in these globals. This is approximately 109 mbytes, which hardly seems excessive.
  5. Looking at the MacOS activity manager, which chose static memory consumption when an app is running (or at rest--if one waits until active processes are completed), shows that Julia is using 983.2 mbytes (my data + Julia itself) and Python at 109.6 mbytes (which should account for Python running the notebook and Python itself). This is a little over 1G. One a 32G machine that hardly counts. Brave (a Google Chrome browser clone with google tentacles removed) shows approximately 1.5G (now that is a lot...). Still the machine isn't showing any strain.

So, it turns out that Jupyter itself sets a memory limit called NotebookApp.max_buffer_size. The default is 536870912 bytes or around 537-ish mbytes or half a Gig. While Julia shows 983-ish at rest, not all of that counts against the Jupyter notebook--only the data that gets created (and the space to hold code). I don't know how much Julia with packages consumes at rest, but let's say it's around 650 mbytes. That means I have consumed around 330 mbytes. Not clear how when varinfo() reports much less. But, there could be much higher transient usage as my code has lots of allocations, especially when preparing a plot. I'll look at that... ...and see what dumb thing I am doing.

For the meantime, I'll increase Jupyter's limit and report back.

lewisl commented 2 years ago

Last reporrt for a while. I gave Jupyter 1G of memory. Kernel stays running longer.

Ran my code several times. Memory usage grows. From the first run to the final run jupyter resource usage increases by 470 mbytes.

I run Sys.free_memory() a couple of times and it reports free memory reduced by over 500 mbytes. So, both are in the same ballpark.

I will build in some diagnostics into the notebook to track this more carefully.

I will also run the same sequence (easy because I use Jupytext so I have a Julia script for the notebook) in the REPL and see how free memory usage changes. Granted there can be other things happening behind my back (like email coming in). I doubt that accounts for much but I'll make sure to stop all the obvious apps within my control.

Comparing the notebook runs and the REPL runs will (maybe) highlight if Julia has the memory leak (doubtful methinks but of course the REPL doesn't have an arbitrary ceiling like Jupyter notebook) or the combination of IJulia/notebook.

lewisl commented 2 years ago

Well, since I started this I am going to close it.

The trouble seems almost random. I thought allowing a kernel a longer timeout would fix it. Sometimes, yes; sometimes, no.

Then I thought it was a memory problem. So, I gave Jupyter a very large memory buffer. Sometimes, yes; sometimes, no.

Then, I add Sys.free_memory calls to the notebook to see what was happening to available memory--could there be a memory leak. Well, this is not such a good way to find real free memory on Macos. Macos is sort of greedy at grabbing memory to cache stuff it thinks it may need to reload. But, it will release that cached file memory as needed. But, queries to free memory may or may not show that depending on how the query runs. Plus, there are constant crazy things happening in background. And regardless of the amount of free memory sometimes the notebook ran; sometimes, no.

Then I thought Plots to GR was the problem (because the kernel almost always hung on running a function that created a plot). So, I switched to the plotly backend thinking that might reduce the load/compile time, the memory footprint, or simply be more reliable on Apple Silicon. Well, the sort of worked for a while. Then it didn't.

Then the cell where I run the simulation started hanging the kernel. That was a first--it hadn't happened in months and months.

So, all this is so arbitrary and inconsistent there is really no way anyone can diagnose it.

I think the problem is a mix of Apple Silicon builds and Intel builds. Julia "beta" 1.7.2 offered a native Apple Silicon build. But maybe not some components of Jupyter even though there is now a beta of Python 3.x that is native. What about zeromq?

So, I threw in the towel and went back to Intel builds for Macos for Julia--using stable release 1.7.3. I deleted my install of Julia and all the packages and started over.

Now, I can run all of the cells of the notebook over and over. Never hangs. So, that is the price of the "bleeding edge". There will be hard things to debug. Not sure what the best tool would be: probably Valgrind.

So, closing this.