dandi / dandi-hub

Infrastructure and code for the dandihub
https://hub.dandiarchive.org
Other
11 stars 23 forks source link

Avoid "Service unavailable" #159

Closed yarikoptic closed 3 months ago

yarikoptic commented 5 months ago

There is an intermittent condition where front face of the hub might become unavailable for a few minutes, e.g. as reported today by @bendichter or occasionally detected by our uptime:

image

asmacdo commented 5 months ago

meanwhile -- is there an easy way to provide some default page/something which would instead say "Service unavailable due to X. Please try again in Y minutes. If still unavailable, check if a known issue on https://github.com/con/upptime/issues?q=is%3Aissue+hub+is%3Aopen . If not - file a new one" or alike?

When this flake is happening (IIRC) the hub pod is getting restarted. Typically for a highly available application, we'd be running 2 hub pods, as soon as hub 1 goes down, hub 2 picks up the traffic. However, since the hub pod is controlled by the jupyterhub helm chart .(and it would cost extra). I don't think thats how we want to handle it.

There might be something we can do with the Amazon side, maybe we could setup a health check and a default page during an outage? From a super quick search and gpt, I think it would work like this: health check fails, R53 changes the routing to a default page, then DNS would have to propagate, then the user gets the helpful message. But then, when the hub comes back up, I guess it would do the same thing in reverse, also with a propagation delay, so it could possibly extend the downtime?

yarikoptic commented 5 months ago

But can someone remind me why this 503 is happening at all (i.e. what stops that pod) or is that by design (non persistent, then why so)?
If we do restart upon hitting 503, could we partially mitigate (i.e. shorten duration and thus potentially make less likely to be hit) but hitting the service with "hearbeat" e.g. every second?

asmacdo commented 5 months ago

But can someone remind me why this 503 is happening at all (i.e. what stops that pod) or is that by design (non persistent, then why so)? If we do restart upon hitting 503,

I haven't looked into it since the problem will "just go away" soon, but I think satra is aware of the reason?

could we partially mitigate (i.e. shorten duration and thus potentially make less likely to be hit) but hitting the service with "hearbeat" e.g. every second?

I don't understand what you mean. As soon as the pod fails, the k8s deployment restarts the pod. (I'm assuming the 503 is the result of that pod being deleted and not whatever the error is). So I dont see how hitting the service every second could help?

yarikoptic commented 5 months ago

I probably have just misread your

When this flake is happening (IIRC) the hub pod is getting restarted.

as sensing the service and causing 503 is the event which triggers a new pod to start, not that it was already "getting started" before that event.

kabilar commented 3 months ago

Closing as this issue happened on our legacy (Ansible) deployment but has not happened with the DoEKS deployment.