High availability story?

dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.

https://gateway.dask.org/

BSD 3-Clause "New" or "Revised" License

137 stars 88 forks source link

High availability story? #148

Closed yuvipanda closed 4 years ago

yuvipanda commented 5 years ago

Hello!

I'm curious how the various components (dask-gateway-server, scheduler-proxy & web-proxy) work in a Highly Available configuration? In a more specific way, I want to change the 'replica' values of all the deployment objects in the helm chart to be 3, so I get:

Resiliency against node failure
Make use of the default rolling update features of Kubernetes

I wanna be able to do deploys anytime without worrying too much about wether it'll affect user experience, and not worry too much about node failures either.

This is one of the big problems right now with jupyterhub - the proxy is going to be HA soon (tm) but the hub itself has a while to go. Would love to know what your thinking on this is!

jcrist commented 5 years ago

They don't really work in a HA configuration. A lot of the internal design is blatantly copied from JupyterHub, so we inherit JupyterHub's problems as well.

When enabled, the gateway server can be killed and restarted without losing any work - the proxies stay up and the gateway server reloads its state from the database. While the gateway server is down all existing clusters still work - users can connect to clusters and view their dashboards. They won't be able to add/remove workers or start/stop clusters until the gateway server is back up.

Since dask-gateway is designed to be run in many different environments (not just Kubernetes), whatever HA story is developed will need to not degrade performance in non HA environments, and not be Kubernetes specific.

I can see a way to make the proxies support multiple load balanced instances, but the gateway server itself contains a lot of tasks and state (background tasks tracking cluster/worker state, etc...) that I'm not sure how to handle in a multi-instance configuration. A few ideas:

Support multiple proxy instances, but only a single gateway server, rely on existing reload-from-database behavior to handle failover. This may be good enough for average use, I'm not sure - reload is pretty quick, but not instantaneous. This wouldn't handle scaling issues for the gateway server, but that may not be necessary.
Support a second failover node that's already running and tracking state, but not doing any of the work. This would lessen the effect of failover (no need for kubernetes to allocate another instance), but would still not be seamless.
Support a fixed number of gateway server instances, and evenly distribute clusters among them (using a simple hash(name) % n scheme to partition the work, storing the instance # in the database). If an instance goes down, only clusters on that instance wouldn't be able to start/stop/scale until it started back up. There's likely a way this could be modified to transfer ownership between server instances, I haven't thought too much about this.

I'm not familiar with common design patterns in this space, your experience and input here would be much appreciated.

mrocklin commented 5 years ago

@jcrist if we were to change dask-gateway to handle all state through the database (both writing and reading) would we then be able to achieve HA, asssuming that the database was highly available?

yuvipanda commented 5 years ago

Yup, it needs to be independent of kubernetes for sure! I was only using these as examples.

Pinging @minrk who has probably thought about this a lot!

jcrist commented 5 years ago

That's not the issue, the issue is the runtime tasks. It's not just a stateless webserver where if there are no requests incoming then nothing happens. There are a few background tasks that would need to be divided among servers - for example there are periodic process liveness checks that you wouldn't want to run multiple times. If a worker dies, we'd need to ensure only one server cleaned up its resources. A kubernetes specific solution would be easier, but this isn't a kubernetes only tool and it's important that HA work in all backends.

minrk commented 5 years ago

If it's like JupyterHub, the HA story is tough because there's quite a bit of in-memory state that's reconstructible from the database, but takes a nontrivial amount of time to do so (see issues about slow jupyterhub startup - this is basically all the work required to reconstruct in-memory state from the database).

The HA design discussed in JupyterHub is to make it more of a stateless server - each request gets a new db connection, makes queries, etc. The challenge here for JupyterHub is the fairly complex 'pending' state locks that prevent certain duplicate / conflicting actions in a way that's easy to reason about with coroutines ("One we start, make sure to set this flag before the first await..."). We would need to think much more carefully about a deterministic/retry-based/eventually-consistent design like you get from Kubernetes to resolve conflicts. There would also be a pretty high performance cost (I think) for the average request, since there is a hot database cache where we make optimizations specifically based on the assumption that we are the only database writer.

We could build a failover-type model, where there is a "hot spare" ready to go that is waiting to load state from the database until it receives a signal. This would reduce transition downtime to exclude the latency associated with starting a process (pod scheduling, etc.), but not to zero, but be a much simpler modification, since it is only a change to process startup and not a change to runtime logic at all (it could even be container-level startup hook without modifying the application at all).

minrk commented 5 years ago

I'll add that tools like BinderHub deployed on top of the JupyterHub are able to themselves be HA by tolerating the downtime of a Hub restart. Making API requests as easily retry-able as possible makes it easier for client applications to behave as if JupyterHub is HA by resubmitting requests on failure. This works for BinderHub because the JupyterHub Hub service doesn't serve any human-facing pages, which would see real downtime during a restart—it only uses the API behind the scenes. So while JupyterHub can't be HA yet, BinderHub can, because:

nothing but direct requests to the Hub fail while the Hub is down
Hub transactions can be retried if connection is lost
proxy can be HA after switch to traefik (or redis store for CHP)
the user-facing BinderHub application has no meaningful state

So if Gateway is principally accessed via an API client, I'd encourage the first task to be making the client tolerant of the kinds of errors associated with a restarting Gateway. This is mainly retrying/recovering from lost connections, with exponential fallback, and any changes necessary in the API to make this easier to do. For instance, one difficult pattern to retry is e.g. a request to create a new resource whose id is assigned during the request and only available in the response. In general, when a request fails to get a response, it's possible for that request to have succeeded or failed. This should be easy for the client to handle. For deterministic resource creation, the easiest way to resolve this is if they are identified as 409 conflict errors (common in Kubernetes, and used in BinderHub).

So if you have deterministic transactions, it looks like:

for i in range(retries):
    try:
        perform_action()
    except 409 and i > 0:
        # already happened, treat as success
        break
    except ConnectionError: # or 500 or certain other errors
        log.warning("Retrying...")
        exponential_wait(i)
        continue
    else:
        # success
        break
else:
    raise RetriesExceeded()

If you can't do that, you need to add a did_it_really_succeed() call before attempts after the first, but that's doable, too, as long as such a request makes sense.

If you do that, for all practical purposes for your users, Gateway is HA, even though the HA is actually implemented in the client.

Refs:

JupyterHub PR to use status 409 instead of 400 on duplicate transaction error types to make this easier for BinderHub
BinderHub code for fault-tolerant API requests to JupyterHub

jcrist commented 5 years ago

Thanks for the background Min, this is really helpful when weighing design decisions.

To step back a bit, there are a few reasons we may want a multi-server backend:

High availability. We'd like the application to still work fine even if a server/node goes down.
Scalability. We'd like to be able to scale the gateway to many users without degrading performance.
Rolling updates. We'd like to be able to update the gateway without a prolonged period of downtime (note that multi-server doesn't necessarily get us this, but it might make things easier).

There may be more reasons. The downside is a more complicated implementation.

For now, we have the same HA story as JupyterHub. The gateway server can go down, and as long as the proxies stay up users can still use their dask clusters. They just won't be able to start/stop/scale any of them. When the gateway server comes back up things will resume where they left off. This may be good enough, or we may want an improved story here, I'm not sure. Until we get more users I think it's too early to say. This is a bit tricky, because now would be the easiest time to move to a HA architecture, but now we also don't know if one is necessary.

Performance wise I'm not sure how many active clusters/workers we can handle at a time. A few bottlenecks that might occur:

We might overwhelm the gateway server with requests to start workers. I don't have a good number for how scalable the current design is - I suspect it matches whatever the max concurrent spawners JupyterHub can support (likely backend dependent, I'm curious what @yuvipanda's experience with JupyterHub on kubernetes can tell us here). Note that once a worker is running the gateway no longer is a bottleneck here.
We might overwhelm the scheduler proxy when communicating between clients and schedulers. Theoretically this should be fairly efficient (it's just copying bytes off a socket onto another socket), but it is a network bottleneck. This would be easy to solve by supporting multiple instances of the proxy and putting them behind a load balancer.
We might overwhelm the web proxy when proxying users dashboards. As with the above, we could support multiple instances of our own proxy, or we could punt this off onto another implementation (e.g. traefik).

For the immediate future I plan to make things more resilient by implementing a retry-with-backoff strategy on internal API calls. For further design decisions I think I'll need to figure out a way to benchmark things first.

(Note that I have mostly no experience with any of these things, and am learning as I go. If you know things and have better suggestions I'd love to hear them)

jcrist commented 5 years ago

For further design decisions I think I'll need to figure out a way to benchmark things first.

Some rudimentary benchmarking shows that we can handle 100 concurrent cluster requests, each with 25 workers (on a mock cluster backend with jittered delays for everything, trying to simulate kubernetes without spending money). Note that this is concurrent cluster requests, which only measures # of users trying to create clusters at the same time, not trying to use already created clusters. This also only benchmarks the gateway server itself. This benchmark is also likely optimistic/wrong, but still was useful. I'll need to add instrumentation to get more meaningful numbers.

It did point out that we likely need to add rate limiting somewhere, as server-internal client requests start to timeout around this point. Could add a semaphore around concurrent worker requests, or something lower-level at the AsyncHTTPClient level).

yuvipanda commented 5 years ago

From a pure kubernetes perspective, I want to try play with https://metacontroller.app/ at some point. You then don't need state in your application, can relegate that to kubernetes. Not sure if an architecture like that will work with traditional HPC systems though.

jcrist commented 4 years ago

I'm working through a rearchitecture plan in #186. I'm going to close this issue to consolidate conversation there.