berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
63 stars 39 forks source link

Stat 20 server start time #3836

Closed ryanlovett closed 1 year ago

ryanlovett commented 2 years ago

Bug description

On the evening of Wed. Oct. 12, stat20 students reported that their servers were slow to start. NFS activity showed no heavy test_stateid ops, but steady getattr and one read spike at 8:07p:

Screen Shot 2022-10-12 at 10 40 30 PM

It looks like server startup time increased beginning at about 8:45p where servers would take up to 2 minutes to start:

Screen Shot 2022-10-12 at 10 46 22 PM

I'll collect more info in this issue.

Environment & setup

How to reproduce

ryanlovett commented 2 years ago

It looks like the hub pod was restarted 86m ago (~9:38p), which would have affected students starting up their servers, but not those who were already running.

hub-5d497fffb-wtx8f             1/1     Running   3 (86m ago)   2d7h
ryanlovett commented 2 years ago

Here's a 24h snapshot:

Screen Shot 2022-10-12 at 11 05 24 PM
ryanlovett commented 2 years ago

@andrewpbray shared this student's screenshot: unnamed (1)

ryanlovett commented 2 years ago

This happened one hour ago.

60m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/42 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 28 Insufficient memory. preemption: 0/42 nodes are available: 16 Insufficient memory, 26 Preemption is not helpful for scheduling.
60m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/43 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 28 Insufficient memory. preemption: 0/43 nodes are available: 16 Insufficient memory, 27 Preemption is not helpful for scheduling.
59m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/43 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 29 Insufficient memory. preemption: 0/43 nodes are available: 16 Insufficient memory, 27 Preemption is not helpful for scheduling.
59m         Normal    Scheduled          pod/jupyter-[snip]                  Successfully assigned stat20-prod/jupyter-[snip] to gke-fall-2019-user-beta-2021-11-06-d520c2e1-6m77
ryanlovett commented 2 years ago

Non running pods. It seems that none failed, but many were pending and eventually started.

Screen Shot 2022-10-12 at 11 26 27 PM
yuvipanda commented 2 years ago

https://github.com/berkeley-dsep-infra/datahub/pull/3839 I'm giving stat20 its own nodepool, with a 5 node minimum

ryanlovett commented 2 years ago
Screen Shot 2022-10-13 at 9 46 38 AM Screen Shot 2022-10-13 at 9 46 56 AM Screen Shot 2022-10-13 at 9 47 35 AM Screen Shot 2022-10-13 at 9 49 28 AM
ryanlovett commented 2 years ago

The beta pool went down to 14 at 8:42p, then went up, up, down, up, up, down, up to 10:18p. Non running pods spiked, probably as a result of this, over that same time period.

@yuvipanda created the epsilon pool in #3839 and move the stat20 hub into it.

@yuvipanda @balajialg Should there be some heuristic about how many different hub images a node pool should support, if this issue was caused due to image-pulling delays?

stat20's image is based on rocker, and we can probably consolidate all rocker-based images into one. However I don't know if there are incompatibilities between libraries in the different rocker images.

Before tonight, the set of hubs with rocker images using the beta pool were biology, ischool, publichealth, and stat20.

balajialg commented 2 years ago

@ryanlovett Thanks for the comprehensive report! I guess it will be helpful if and when we choose to write an After-Action report. Maybe @shaneknapp and @felder in discussion with you can review the existing images across the different node pools and make the final call with regard to heuristics? I will also add this as an agenda item for our upcoming team meeting.

I see that we have been moving other images across node pools in response to issues - Check this and this. Might as well take a proactive stance going forward!

shaneknapp commented 2 years ago

yep, all sounds good.

On Thu, Oct 13, 2022 at 11:24 AM Balaji Alwar @.***> wrote:

@ryanlovett https://github.com/ryanlovett Thanks for the comprehensive report! I guess it will be helpful if and when we choose to write an After-Action report. Maybe @shaneknapp https://github.com/shaneknapp and @felder https://github.com/felder in discussion with you can review the existing images across the different node pools and make the final call? I will also add this as an agenda item for our upcoming team meeting.

I see that we have been moving other images across node pools in response to issues - this https://github.com/berkeley-dsep-infra/datahub/pull/3705 and this https://github.com/berkeley-dsep-infra/datahub/pull/3615. Might as well take a proactive stance going forward!

— Reply to this email directly, view it on GitHub https://github.com/berkeley-dsep-infra/datahub/issues/3836#issuecomment-1278009624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMIHLF5WSSDAGBHWQZGWPDWDBHVTANCNFSM6AAAAAARD5F7KY . You are receiving this because you were mentioned.Message ID: @.***>

ryanlovett commented 2 years ago

stat20 has gone from ~10 users to ~190 over the past couple of hours and it looks like the epsilon node count is going up, and not fluctuating.

Screen Shot 2022-10-14 at 10 19 47 AM Screen Shot 2022-10-14 at 10 20 01 AM
ryanlovett commented 2 years ago

There have been a few occurrances where servers took as long as ~1m to start though:

Screen Shot 2022-10-14 at 10 21 15 AM
yuvipanda commented 2 years ago

yeah, 1min is fine. 10min is when hub starts counting them as dead. I think this would've caused a similar outage as last time if we hadn't moved them to their own pool

ryanlovett commented 2 years ago

When I display the data for a lot of -prod hubs, there are a lot of other server start times of just less than 1m (not stat20-specific), and they're clustered together. It seems like they're roughly 15m apart. (11:01a, 10:46a, 10:31a, 10:16a, ...)

Screen Shot 2022-10-14 at 11 15 22 AM

I picked a day two weeks back to view, and they're mostly around 10s, and they're not clustered together.

Screen Shot 2022-10-14 at 11 32 48 AM

So perhaps the NFS restarts are pushing back some set of server start times to 1m.

ryanlovett commented 1 year ago

stat20 has been moved to a filestore, and will also retain its own node pool going forward.