dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

Deploying dask on YARN in an enterprise setting #2043

Open jcrist opened 6 years ago

jcrist commented 6 years ago

Apologies in advance for the long issue, this is in response to several offline conversations with @mrocklin and @martindurant

With skein we have a way to startup arbitrary services on YARN clusters. The intent of that project was to be generally useful, but specifically to get dask working nicely on YARN. This is now doable per-user by ssh'ing into the edge node and running things manually (described in more detail below). The next step is figuring out how this might work within a larger enterprise ecosystem.

This raises a few issues due to the following


Below I enumerate 4 options for deploying dask on YARN, increasing in complexity, and discuss their pros and cons.

1. Manual Remote kernel

dask-enterprise 002

In this configuration, the user does the following:

This is currently doable using skein and dask. Skein/dask-yarn/whatever could make this process easier on the user by adding support for a reverse-proxy (something like https://github.com/jupyterhub/nbserverproxy) which could handle finding all the dynamic address automatically. This would allow them to always tunnel through the same localhost:port combo:

# On the edge node
dask-yarn proxyserver --port 8586  # port unique per user

Pros

Cons

2. Manual Local Kernel

dask-enterprise 001

This is similar to 1., but runs the client process locally instead of remotely.

As with 1, this could be made easier by having the user start a small proxy service on the edge node to handle finding the dynamic addresses, so they can always tunnel through the same localhost:port combo.

Pros:

Cons

3. Jupyter Gateway

dask-enterprise 003

This would use https://github.com/jupyter-incubator/enterprise_gateway and jupyterhub to deploy remote kernels as a service. I think (but haven't attempted) that adding deployment using skein should be straightforward with their pluggable design.

In this situation, the user would login to jupyterhub, specify what environment they want to deploy, and then click "start notebook". This would start a remote kernel on the cluster, managed by a skein application master. The skein specification that started that kernel could also include other services like dask, which could be scaled up or down by the user easily from inside the notebook using the skein library (already works fine).

Pros:

Cons:

4. Dask Gateway

dask-enterprise 004

In this case users would start a jupyter kernel anywhere (locally, in AE5, wherever), and then contact a dask specific server on the edge node to spin up a cluster and manage communications for them. This is perhaps the nicest solution, but also the most complicated. It would look something like:

Pros

Cons:

jcrist commented 6 years ago

In the immediate future, I plan to make the manual methods (1 and 2) easier to do. I think this covers many use cases, and will let people try dask out without requiring IT/corporate support.

jcrist commented 6 years ago

Also cc @hussainsultan.

martindurant commented 6 years ago

You might also want to consider my simplistic approach, which may be useful in some cases: that a port is opened on the edge node per dask scheduler that is launched and a proxy forwarder set up (I experimented with https://github.com/ArashPartow/proxy , but SSH I suppose can do this), and IT allows for a range of ports to be opened for the dask scheduler service, managed on the edge node directly by the skein daemon or by a very simplistic gateway service. I regard this as the simplest solution, if perhaps not very palatable.

mrocklin commented 6 years ago

For the proxy I was previously assuming that this would be some generic TCP thing. Now it seems like this might be more dask specific. Dask already has some infrastructure to handle things like this. For example we're accustomed to having the scheduler proxy messages between the client and workers without much deserialization (though this could be improved). @pitrou 's work on the comms infrastructure might also provide an opportunity to build some new proxy'ed protocol: http://distributed.readthedocs.io/en/latest/communications.html .

The inproc:// protocol is probably a good example of a protocol that includes both an top-level address and then a secondary address (pid/queue number)

In [1]: from dask.distributed import Client

In [2]: client = Client(processes=False)

In [3]: client.scheduler.address
Out[3]: 'inproc://192.168.50.100/29369/1'

In [4]: client.cluster.workers[0].address
Out[4]: 'inproc://192.168.50.100/29369/2'
mrocklin commented 6 years ago

For the solution proposed by @martindurant it would be useful to know how palatable this is to current IT groups.

martindurant commented 6 years ago

it would be useful to know how palatable this is to current IT groups.

indeed, and I would be the first to admit that I don't really have a guess - probably depends on whether there is a maximum of 10 schedulers, or 1000. The dashboard can of course go through a normal HTTP proxy.

Interesting idea to use existing dask comms for the gateway.

mrocklin commented 6 years ago

cc @mariusvniekerk

mrocklin commented 6 years ago

Also cc @ericdill

mariusvniekerk commented 6 years ago

So one thing that might be feasible for some of this is seeing how much we could leverage the yarn proxy. Depending on configuration that already bakes in a decent chunk of security for us.

This could allow situations like 3 and 4 potentially without needing something heavy on the edge node.

mrocklin commented 6 years ago

Can you say more about this @mariusvniekerk ? Is the Yarn proxy happy to route TCP connections around?

martindurant commented 6 years ago

What is "the YARN proxy"?

jcrist commented 6 years ago

The YARN proxy is a proxy that YARN manages that applications can tie into to get easy tie in with hadoop security. I investigated using this in the early days of skein, but abandoned it due to the following reasons:

mrocklin commented 6 years ago

I spoke with @ericdill a month or so ago and he was saying that having a client not on the yarn cluster isn't actually that useful because you'll need to be able to see HDFS and all that. This probably means that the dask tcp proxy idea isn't as valuable as I had previously thought.

jcrist commented 6 years ago

It could still be useful if we switch our filesystem operations like glob to run as a worker instead of from a client. Since all actual reads happen on the workers, this would make the proxy complete. Proxying at the notebook level though using something like Jupyter enterprise-gateway would be easier though, and IMO a more robust (but less flexible) option.

martindurant commented 6 years ago

Interesting point about "remote file-systems" in the comment above. Is this something that filesystems could help prescribe, i.e., operations on an arbitrary file-system proxied via a gateway that we well understand? Or would you want a dask-specific solution (put a list of URLs into a future from a task)?