dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
137 stars 88 forks source link

High availability story? #148

Closed yuvipanda closed 4 years ago

yuvipanda commented 5 years ago

Hello!

I'm curious how the various components (dask-gateway-server, scheduler-proxy & web-proxy) work in a Highly Available configuration? In a more specific way, I want to change the 'replica' values of all the deployment objects in the helm chart to be 3, so I get:

  1. Resiliency against node failure
  2. Make use of the default rolling update features of Kubernetes

I wanna be able to do deploys anytime without worrying too much about wether it'll affect user experience, and not worry too much about node failures either.

This is one of the big problems right now with jupyterhub - the proxy is going to be HA soon (tm) but the hub itself has a while to go. Would love to know what your thinking on this is!

jcrist commented 5 years ago

They don't really work in a HA configuration. A lot of the internal design is blatantly copied from JupyterHub, so we inherit JupyterHub's problems as well.

When enabled, the gateway server can be killed and restarted without losing any work - the proxies stay up and the gateway server reloads its state from the database. While the gateway server is down all existing clusters still work - users can connect to clusters and view their dashboards. They won't be able to add/remove workers or start/stop clusters until the gateway server is back up.

Since dask-gateway is designed to be run in many different environments (not just Kubernetes), whatever HA story is developed will need to not degrade performance in non HA environments, and not be Kubernetes specific.

I can see a way to make the proxies support multiple load balanced instances, but the gateway server itself contains a lot of tasks and state (background tasks tracking cluster/worker state, etc...) that I'm not sure how to handle in a multi-instance configuration. A few ideas:

I'm not familiar with common design patterns in this space, your experience and input here would be much appreciated.

mrocklin commented 5 years ago

@jcrist if we were to change dask-gateway to handle all state through the database (both writing and reading) would we then be able to achieve HA, asssuming that the database was highly available?

yuvipanda commented 5 years ago

Yup, it needs to be independent of kubernetes for sure! I was only using these as examples.

Pinging @minrk who has probably thought about this a lot!

jcrist commented 5 years ago

That's not the issue, the issue is the runtime tasks. It's not just a stateless webserver where if there are no requests incoming then nothing happens. There are a few background tasks that would need to be divided among servers - for example there are periodic process liveness checks that you wouldn't want to run multiple times. If a worker dies, we'd need to ensure only one server cleaned up its resources. A kubernetes specific solution would be easier, but this isn't a kubernetes only tool and it's important that HA work in all backends.

minrk commented 5 years ago

If it's like JupyterHub, the HA story is tough because there's quite a bit of in-memory state that's reconstructible from the database, but takes a nontrivial amount of time to do so (see issues about slow jupyterhub startup - this is basically all the work required to reconstruct in-memory state from the database).

The HA design discussed in JupyterHub is to make it more of a stateless server - each request gets a new db connection, makes queries, etc. The challenge here for JupyterHub is the fairly complex 'pending' state locks that prevent certain duplicate / conflicting actions in a way that's easy to reason about with coroutines ("One we start, make sure to set this flag before the first await..."). We would need to think much more carefully about a deterministic/retry-based/eventually-consistent design like you get from Kubernetes to resolve conflicts. There would also be a pretty high performance cost (I think) for the average request, since there is a hot database cache where we make optimizations specifically based on the assumption that we are the only database writer.

We could build a failover-type model, where there is a "hot spare" ready to go that is waiting to load state from the database until it receives a signal. This would reduce transition downtime to exclude the latency associated with starting a process (pod scheduling, etc.), but not to zero, but be a much simpler modification, since it is only a change to process startup and not a change to runtime logic at all (it could even be container-level startup hook without modifying the application at all).

minrk commented 5 years ago

I'll add that tools like BinderHub deployed on top of the JupyterHub are able to themselves be HA by tolerating the downtime of a Hub restart. Making API requests as easily retry-able as possible makes it easier for client applications to behave as if JupyterHub is HA by resubmitting requests on failure. This works for BinderHub because the JupyterHub Hub service doesn't serve any human-facing pages, which would see real downtime during a restart—it only uses the API behind the scenes. So while JupyterHub can't be HA yet, BinderHub can, because:

So if Gateway is principally accessed via an API client, I'd encourage the first task to be making the client tolerant of the kinds of errors associated with a restarting Gateway. This is mainly retrying/recovering from lost connections, with exponential fallback, and any changes necessary in the API to make this easier to do. For instance, one difficult pattern to retry is e.g. a request to create a new resource whose id is assigned during the request and only available in the response. In general, when a request fails to get a response, it's possible for that request to have succeeded or failed. This should be easy for the client to handle. For deterministic resource creation, the easiest way to resolve this is if they are identified as 409 conflict errors (common in Kubernetes, and used in BinderHub).

So if you have deterministic transactions, it looks like:

for i in range(retries):
    try:
        perform_action()
    except 409 and i > 0:
        # already happened, treat as success
        break
    except ConnectionError: # or 500 or certain other errors
        log.warning("Retrying...")
        exponential_wait(i)
        continue
    else:
        # success
        break
else:
    raise RetriesExceeded()

If you can't do that, you need to add a did_it_really_succeed() call before attempts after the first, but that's doable, too, as long as such a request makes sense.

If you do that, for all practical purposes for your users, Gateway is HA, even though the HA is actually implemented in the client.

Refs:

jcrist commented 5 years ago

Thanks for the background Min, this is really helpful when weighing design decisions.

To step back a bit, there are a few reasons we may want a multi-server backend:

There may be more reasons. The downside is a more complicated implementation.

For now, we have the same HA story as JupyterHub. The gateway server can go down, and as long as the proxies stay up users can still use their dask clusters. They just won't be able to start/stop/scale any of them. When the gateway server comes back up things will resume where they left off. This may be good enough, or we may want an improved story here, I'm not sure. Until we get more users I think it's too early to say. This is a bit tricky, because now would be the easiest time to move to a HA architecture, but now we also don't know if one is necessary.

Performance wise I'm not sure how many active clusters/workers we can handle at a time. A few bottlenecks that might occur:

For the immediate future I plan to make things more resilient by implementing a retry-with-backoff strategy on internal API calls. For further design decisions I think I'll need to figure out a way to benchmark things first.

(Note that I have mostly no experience with any of these things, and am learning as I go. If you know things and have better suggestions I'd love to hear them)

jcrist commented 5 years ago

For further design decisions I think I'll need to figure out a way to benchmark things first.

Some rudimentary benchmarking shows that we can handle 100 concurrent cluster requests, each with 25 workers (on a mock cluster backend with jittered delays for everything, trying to simulate kubernetes without spending money). Note that this is concurrent cluster requests, which only measures # of users trying to create clusters at the same time, not trying to use already created clusters. This also only benchmarks the gateway server itself. This benchmark is also likely optimistic/wrong, but still was useful. I'll need to add instrumentation to get more meaningful numbers.

It did point out that we likely need to add rate limiting somewhere, as server-internal client requests start to timeout around this point. Could add a semaphore around concurrent worker requests, or something lower-level at the AsyncHTTPClient level).

yuvipanda commented 5 years ago

From a pure kubernetes perspective, I want to try play with https://metacontroller.app/ at some point. You then don't need state in your application, can relegate that to kubernetes. Not sure if an architecture like that will work with traditional HPC systems though.

jcrist commented 4 years ago

I'm working through a rearchitecture plan in #186. I'm going to close this issue to consolidate conversation there.