elyra-ai / elyra

Elyra extends JupyterLab with an AI centric approach.
https://elyra.readthedocs.io/en/stable/
Apache License 2.0
1.86k stars 344 forks source link

JupyterLab freezes when the output from a shell command exceeds a certain size #2825

Open akchinSTC opened 2 years ago

akchinSTC commented 2 years ago

Describe the issue When the output from a shell command exceeds a certain size, JupyterLab and classic notebooks just freeze. This can be achieved by simply writing a few 100 KB to stdout: !head -c 118000 </dev/urandom . Printing the same amount of data to stdout using Python does not cause any issues. In fact, Python allows for much larger outputs. This only seems to affect containerized/resource restricted deployments of elyra at the moment since this cannot be reproduced locally on my workstation. Ignore, can reproduce locally on my workstation after a flat install The problem seems to be intermittent as well, there were some cases where on FIRST execution of the command in a cell, it runs through without issue, subsequent execution of the same cell result in a stall.

To Reproduce Steps to reproduce the behavior:

  1. Open Kubeflow's notebook launcher
  2. Request at least 2 CPUs, (or more if your deployment can allow it)
  3. Launch an instance of elyra/kf-notebook:latest
  4. Open a new jupyter notebook
  5. run !head -c 118000 </dev/urandom

Screenshots or log output If applicable, add screenshots or log output to help explain your problem.

Log Output
Paste the log output here.

image

image

Expected behavior Output from the shell to be displayed in the cell image

Deployment information Describe what you've deployed and how:

ptitzler commented 2 years ago

I've recreated this using a plain JupyterLab deployment configuration that comes with Kubeflow [notebook server] 1.5. Since none of the Elyra extensions is installed this should confirm that this is an upstream issue.

akchinSTC commented 2 years ago

Thanks for checking, I'll see if there's an existing issue or get an issue up in jupyter upstream and go from there.

ptitzler commented 2 years ago

Also confirmed the problem in a stand-alone installation of JupyterLab (3.4.3). The hang did occur intermittently. Updating the issue title to reflect that this is not Elyra specific.

akchinSTC commented 2 years ago

There is a timeout error that appears after approx. 2 min.

image
kevin-bates commented 2 years ago

Can someone add a description of the hang or freeze? Is the hang "permanent" in that the only recourse is to terminate the web server? Is there anything that can be accomplished during this period?

ptitzler commented 2 years ago

It depends on which action is taken:

image

wait -> the page remains unusable exit -> the page goes away image

Irrespective which action is selected, one can open a new JupyterLab window (the server seems to remain operational.)

kevin-bates commented 2 years ago

Thanks @ptitzler - this was helpful. I was able to reproduce the issue by increasing the byte count on the head call.

It strikes me as a front-end/browser issue because I can hit the server using REST calls to other services (get kernelspecs, start/delete a kernel, etc.). In addition, when submitting the cell, I see the response stream, followed by the idle status, almost immediately...

[D 2022-07-11 19:43:43.151 ServerApp] activity on b55119d4-5224-45cc-a747-db66584c549a: status (busy)
[D 2022-07-11 19:43:43.152 ServerApp] activity on b55119d4-5224-45cc-a747-db66584c549a: execute_input
[D 2022-07-11 19:43:43.236 ServerApp] activity on b55119d4-5224-45cc-a747-db66584c549a: stream
[D 2022-07-11 19:43:43.305 ServerApp] activity on b55119d4-5224-45cc-a747-db66584c549a: stream
[D 2022-07-11 19:43:43.340 ServerApp] activity on b55119d4-5224-45cc-a747-db66584c549a: stream
[D 2022-07-11 19:43:43.437 ServerApp] activity on b55119d4-5224-45cc-a747-db66584c549a: status (idle)

shortly after, I see a 90-second websocket timeout, which I suspect is because the front-end isn't able to ping it or keep it alive (but that's purely conjecture).

[W 2022-07-11 19:44:49.799 ServerApp] WebSocket ping timeout after 90000 ms.
[D 2022-07-11 19:44:54.800 ServerApp] Websocket closed b55119d4-5224-45cc-a747-db66584c549a:5fef0422-b828-44ce-903b-151cffbfb5da
[I 2022-07-11 19:44:54.800 ServerApp] Starting buffering for b55119d4-5224-45cc-a747-db66584c549a:5fef0422-b828-44ce-903b-151cffbfb5da
[D 2022-07-11 19:44:54.801 ServerApp] Clearing buffer for b55119d4-5224-45cc-a747-db66584c549a

All in all, this is feeling like an investigation that needs to start from the Lab side of things (FWIW).

We should scan SO and the Jupyter Community Forum for similar reports. This seems like a rendering thing.

~I'd be curious to know how Lab decides to display the "page unresponsive" dialog and perhaps work backward from there.~. EDIT: Duh, this is the browser indicating that things aren't going well. I see similar (but different presentations) using other browsers.

kevin-bates commented 2 years ago

So I let my 1 MB byte count (!head -c 1048576 < /dev/urandom) run last night and it did eventually show up in the notebook. However, the notebook couldn't be saved (via checkpointing) due to JSON format issues with:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud9a4' in position 7268: surrogates not allowed

I can also bring up an entirely different Lab browser running against the same server and it works fine, so the hang is in Lab.

ptitzler commented 2 years ago

@daschnerm, please subscribe to the issue for updates

daschnerm commented 2 years ago

@ptitzler / @kevin-bates Any update regarding this ? Should we additionally raise an issue with Jupyter ?

kevin-bates commented 2 years ago

Hi @daschnerm. @akchinSTC has been out this past week, but I believe we (the elyra core team) had talked about @akchinSTC and @ajbozarth taking this up from the Lab side of things after I confirmed that only the browser session is frozen, yet the server is not. @ajbozarth is a committer for Lab and can navigate the code base best (along with having closer contact to other devs in Lab).

I suspect that Lab devs will know what's going on and/or have heard of this before (i.e., there's probably an existing issue for this). The client-side (browser) logs should also be analyzed for anomalies.