2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 59 forks source link

[Incident] Kernel restarts and internal gateway errors for for U. Texas hub #1021

Closed choldgraf closed 2 years ago

choldgraf commented 2 years ago

Summary

We recently deployed the UT Austin hub, which is running a database sidecar for each student (https://github.com/2i2c-org/infrastructure/issues/968). Since the hub began use, we've gotten reports of intermittent kernel crashes affecting students. This has been difficult to reproduce, but has been happening consistently.

There was originally a FreshDesk ticket about this here: https://2i2c.freshdesk.com/a/tickets/84?lang=en

We thought we had resolved it by increasing the memory limits for each user session, but these issues have persisted. Here are the two major things that we've found:

  1. Users experiencing "kernel died" errors sporadically
  2. Users getting "internal gateway errors" when using pgadmin (https://github.com/2i2c-org/infrastructure/issues/968#issuecomment-1048229587)

Here's an issue where we describe some of the problems people have, and how to reproduce them:

https://github.com/2i2c-org/infrastructure/pull/1008#issuecomment-1047372853

Hub information

Timeline (if relevant)

All times in US/Pacific.

2022-02-17

2022-02-18

We conclude that this may be a problem with memory limits students are hitting during queries, and bump their guarantee here: https://github.com/2i2c-org/infrastructure/pull/1008

2022-02-21

We get reports that many students are still running into this problem. It is not reliable to replicate, but people are consistently hitting these errors.

ref: https://github.com/2i2c-org/infrastructure/pull/1008#issuecomment-1047040355

TODO: add more as new developments happen


After-action report

These sections should be filled out once we've resolved the incident and know what happened. They should focus on the knowledge we've gained and any improvements we should take.

What went wrong

Things that could have gone better. Ideally these should result in concrete action items that have GitHub issues created for them and linked to under Action items.

Where we got lucky

These are good things that happened to us but not because we had planned for them.

Action items

These are only sample subheadings. Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Documentation improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Technical improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Actions

choldgraf commented 2 years ago

@jameshowison - if you like feel free to add context about the problems you're seeing here. I opened this issue so that we could track this as an ongoing incident because it sounds more complex than expected.

choldgraf commented 2 years ago

Here are some potentially relevant graphs from our JupyterHub dashboard in the Grafana. The following plots are just for the utexas hub...it seems like at some moments the hub response latency spikes, and I wonder if that's related to these problems as well.

image

jameshowison commented 2 years ago

I think those peaking times correspond with my class (which is 11-12:30 Tues/Thur central time)

Only approach to replicating is to do what i'm asking the class to do:

  1. Open up a new server at https://utexas.pilot.2i2c.cloud/
  2. Import the class_music_festival github folder via: http://utexas.pilot.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhowisonlab%2Fdatawrangling-exercises&branch=main&urlpath=lab%2Ftree%2Fdatawrangling-exercises%2Fclass_music_festival
  3. Run create_class_music_festival.ipynb
  4. Run load_class_music_festival.ipynb
  5. Run queries against the class_music_festival database. eg.
SELECT bands.name
FROM bands
WHERE bands.id IN (11, 16, 22, 23, 30, 48, 11, 55, 22, 16, 55, 23, 30, 100, 108, 48)
choldgraf commented 2 years ago

Just noting that I just tried this and was able to run all of those commands without an issue. So maybe this is an error that only pops up when a lot of users are doing it at once?

jameshowison commented 2 years ago

Yes, the issue is definitely transient.

Thing is that I don't know what to do once faced with it. If I was doing this on a server directly I'd be looking at logs, but I don't know if/how to access the logs. Is there some way to centralize logs for review, or could note username and time for later review?

I'm trying to do a screenshare with a student that has it frequently (including at non-peak times and with a totally new instance from a new github username). That makes me wonder whether it is a client browser issue, is that possible?

I'm also going to try to give people instructions for doing queries via iPython SQL magics, in case this is a xeus-sql issue (I think it's only happened in the xeus-sql kernels, but we haven't used ipython kernels yet).

jameshowison commented 2 years ago

I just did a screen share with the user on the hub with the github username lkiil. She is getting the error consistently when running commands with the homework.ipynb file. Perhaps someone can get into that server and take a look?

I tried using the built in terminal to find logs, but no luck. Googling suggested sudo journalctl -u jupyter would help but no dice (sudo not installed0, so I'm guessing one would need to ssh into the container for that to be useful.

jameshowison commented 2 years ago

I was able to use the admin interface to connect to the server for lkiil and I can confirm that I'm seeing the error when I execute cells in my browser. Can't see any logs via the notebook terminal, though.

yuvipanda commented 2 years ago

I'm sorry you're running into this, @jameshowison - definitely frustrating. Thank you for helping us debug as we go along!

Notebook logs are written to ~/.jupyter-server-log.txt file - you can check that out. I looked and found that xeus-sql was crashing with an error message terminate called after throwing an instance of 'char const*'. I used our Google Cloud logger (query) to search for that particular error message in all notebook logs, and found that has been happening for a while (just looking at last 7 days)

image

I think if we could actually get the value of the char const* we might be able to get a better idea of what the error is. I've opened https://github.com/jupyter-xeus/xeus-sql/issues/60 to let the xeus-sql developers know.

yuvipanda commented 2 years ago

On the postgresql container side, I can see that there's an error message saying 'FATAL: terminating connection due to administrator command' (query), and its frequence looks like this:

image

Looking around, it looks like this might becaused by external process sending SIGTERM to the database (https://www.postgresql.org/message-id/4564.1284559661@sss.pgh.pa.us). In our case, that might be because postgres is hitting a memory limit. I'm going to bump the limit up and see if that helps - earlier we had just bumped up the limit for the data the kernel can handle, but now am going to do that for the database itself.

yuvipanda commented 2 years ago

But looking at memory usage for the postgres container itself,

image

The current memory limit is 512Mi, and we are nowhere near close to touching that.

yuvipanda commented 2 years ago

I've also installed jupyterlab-limit-output - while it seems to work in Python, it seems to have no effect on xeus-sql kernel notebooks :(

image
yuvipanda commented 2 years ago

@jameshowison I've brought in a newer version of xeus-sql with https://github.com/jupyter-xeus/xeus-sql/issues/60 merged, so hopefully the error logs are more helpful next time.

jameshowison commented 2 years ago

Great. I can just ask everyone to stop and start their server, right? And that will update everything?

yuvipanda commented 2 years ago

@jameshowison nobody else is on, so I think everyone (including you!) have the newest version already.

I'm investigating the pgadmin issue. Not sure why that came back...

yuvipanda commented 2 years ago

@jameshowison ok, for the large outputs freezing the browser, I've a partial solution - to use the classic notebook. I've enabled https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/limit_output, which seems to work well.

Switch to using classic notebook interface, using the interface switcher in jupyterlab on top right

image image

If I then run SELECT * FROM tickets, I don't crash the browser! Instead, I see at the end of my scroll:

image

You can also modify your nbgitpuller links to point to classic notebook instead of jupyterlab if you would like, although then you lose a lot of the benefits you get with JupyterLab.

yuvipanda commented 2 years ago

There are basically three issues:

  1. Kernel dying
  2. Browser freezing
  3. pgadmin4 not working

Today, I've put in place fixes for (1) and (2), but not (3). Intermittent issues are always a bit of a mess to debug... I'm hopeful I can put in some more time for (3) tomorrow.

yuvipanda commented 2 years ago

I've also opened https://github.com/deshaw/jupyterlab-limit-output/pull/2 to add suppoort for xeus-sql in jupyterlab-limit-output

choldgraf commented 2 years ago

This is really excellent investigative engineering @yuvipanda :-)

Would it be helpful for us to document the jupyterlab-limit-output extension as well? I'm happy to submit a PR if we think it'll be common-enough to put in docs/

(as an aside, what a strange world we live in where we can just submit a patch to a random JupyterLab extension developed by DEShaw)

jameshowison commented 2 years ago

Thanks all. Status here going into class tomorrow is that we have bumped memory on postgres and improving logging with xeus-sql. I intend to use xeus-sql with %%sql magics as a fallback. If we encounter issues I'll take a look at the jupyter logs and flag the user accounts here (so we have a chance of seeing them before the container shuts down after class).

I don't read that we have a definite diagnosis on what was causing it, best assessment is some resource spike on the postgres container, right? My only input there is that I think that I was always able to query the postgres sidecar, but I'll check for certain from the terminal if issues show up again. Perhaps it was just transitory but xeus-sql already terminated and by the time it came back up postgres was responsive again?

One student that was having issues does say that she hasn't seen them in the second half of the day today, which is hopeful!

yuvipanda commented 2 years ago

@jameshowison yeah, current working diagnosis is that it's a resource limit issue on the postgres database. However, the hope is that the additional debug logging in xeus-sql can give us a better answer for why the kernel is crashing - with the previous version, that information was being lost. I'm hopeful that if the kernels crash again, we'll be investigating less blindly.

jameshowison commented 2 years ago

This morning prepping for class saw a new error. "Invalid Response: 504"

Screen Shot 2022-02-24 at 8 53 08 AM

log shows:

[E 2022-02-08 16:56:10.202 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2022-02-08 16:56:10.975 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2022-02-08 16:56:11.132 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
choldgraf commented 2 years ago

Hmm - does that error persist? Do you know if others have run into it?

jameshowison commented 2 years ago

Only me working with it at present, but I'm working in ipython rather than xues-sql, so I have a backup if it does.

jameshowison commented 2 years ago

No sign of kernel issues during class today. No repeat of the "Invalid Response Error". I will report as student do their assignments remainder of the week and weekend.

yuvipanda commented 2 years ago

@jameshowison GREAT TO HEAR! Is that with xeus-sql or with the ipython magic?

jameshowison commented 2 years ago

xeus-sql so that's good!

jameshowison commented 2 years ago

Have run into another issue. A student's xeus-sql ipynb file is making the server unresponsive. The file is 33.9MB large, so I suspect that it has inside it a number of queries that returned very large results. I downloaded the file and yes, it has some very large output cells.

https://utexas.pilot.2i2c.cloud/user/andreakxn/lab/workspaces/auto-S/tree/datawrangling-exercises/class_music_festival/Untitled.ipynb

I'm just not sure how to handle this, I have asked that they include LIMIT 100 or something as they work, but need some guard rails here.

There is a discussion on xeus-sql of options for limiting results. https://github.com/jupyter-xeus/xeus-sql/issues/59

jameshowison commented 2 years ago

btw, I removed that large file, but I have a copy if it is ever needed for debugging.

jameshowison commented 2 years ago

The "Invalid Response: 504" error occurred again this morning, immediately after logging in and creating a new xsql file. Choosing "Restart" and the new xsql kernel seemed to work.

choldgraf commented 2 years ago

hmmm - maybe it's a race condition of some kind? This only ever happens right after logging in and running for the first time?

jameshowison commented 2 years ago

Yes, so far I've only seen it on startup. Will report if I see it at other times.

jameshowison commented 2 years ago

Still have this issue of the Bad Gateway popping up. Sometimes it goes away when tell it to restart xsql, but just then I had to say "No Kernel" and then start an xsql kenel.

Scroll down in http://howisonlab.github.io/datawrangling/class_modules/sql_queries/01_intro/ to see what I mean

sgibson91 commented 2 years ago

@2i2c-org/tech-team What do we need to move this forward? Do we need the ~/.jupyter-server-log.txt from the next time a Bad Gateway error appears, or is there another path?

choldgraf commented 2 years ago

Maybe @yuvipanda has some ideas here since he set this one up. I seem to remember that he said this suggests there is a websockets issue but I'm not sure how to debug that. I did find a stackoverflow post with the same error message, and indeed they mention web sockets in there as well: https://stackoverflow.com/questions/57297987/jupyterhub-kernel-connection-returns-http504-gateway-timeout).

sgibson91 commented 2 years ago

I thought the websockets issue was related to this ticket? https://discourse.pangeo.io/t/google-deployment-stuck/2280

yuvipanda commented 2 years ago

I think current state is:

  1. xeus-sql is producing some weird 504 gateway errors
  2. ipython-sql works alright, doesn't cause any issues
  3. xeus-sql produces results sometimes that are too large and freezes user browsers
  4. pgadmin4 is still broken

Is this an accurate summary of current status, @jameshowison? If ipython-sql works fine, then perhaps a longer term outlook is to move towards making sure ipython-sql works well, and fix pgadmin4...

choldgraf commented 2 years ago

Hey all - I am going to close out this issue, since I think the hub is in a good-enough steady state, and doesn't require active changes to it. We can spot-check improvements via support channels and new issues.