Deploying dask on YARN in an enterprise setting

jcrist commented 6 years ago

Apologies in advance for the long issue, this is in response to several offline conversations with @mrocklin and @martindurant

With skein we have a way to startup arbitrary services on YARN clusters. The intent of that project was to be generally useful, but specifically to get dask working nicely on YARN. This is now doable per-user by ssh'ing into the edge node and running things manually (described in more detail below). The next step is figuring out how this might work within a larger enterprise ecosystem.

This raises a few issues due to the following

Most Hadoop clusters have a limited number of ports open. IT is responsible for managing these ports, and are usually reluctant to add more unless they're for a specific service that's always running and is known to be secure.
Because of this, users usually ssh into the edge node. From here they can start jobs and do work, interacting with processes that run throughout the cluster.
Most users have limited disk space on the edge node (<= 200 MB). Larger files should be kept on hdfs, or on external workstations.
Ideally users should not be running large processes on the edge node. This includes things like Jupyter kernels, the dask scheduler, etc...
Hadoop security (kerberos, etc...) is complicated and can vary between deployments. Integrating nicely in all cases may be tricky.

Below I enumerate 4 options for deploying dask on YARN, increasing in complexity, and discuss their pros and cons.

1. Manual Remote kernel

dask-enterprise 002

In this configuration, the user does the following:

SSH into the edge node
Uses skein to launch a YARN job that starts up a jupyter kernel, dask scheduler, and some number of dask workers
Uses skein to get the address of the jupyter kernel/dask dashboard/whatever other web stuff is needed
Uses ssh to tunnel the notebook/dashboard outside the cluster (e.g. ssh -L external_port:address:port)
Interacts with the remote kernel from their browser

This is currently doable using skein and dask. Skein/dask-yarn/whatever could make this process easier on the user by adding support for a reverse-proxy (something like https://github.com/jupyterhub/nbserverproxy) which could handle finding all the dynamic address automatically. This would allow them to always tunnel through the same localhost:port combo:

# On the edge node
dask-yarn proxyserver --port 8586  # port unique per user

Pros

Easy, already works
Requires no IT support
Flexible

Cons

Client environment is also remote, must be bundled up the same as all the worker environments.
Manual process, not as easy as other deployments
Requires per-user process on edge node

2. Manual Local Kernel

dask-enterprise 001

This is similar to 1., but runs the client process locally instead of remotely.

SSH into the edge node
Uses skein to launch a YARN job that starts up a dask scheduler and some number of dask workers on the cluster
Uses skein to get the address of the dask scheduler and dashboard
Uses ssh to tunnel the scheduler and dashboard outside the cluster (e.g. ssh -L external_port:address:port)
Starts a local jupyter kernel
Uses the local jupyter kernel to start a client and connect to the scheduler over the ssh tunnel

As with 1, this could be made easier by having the user start a small proxy service on the edge node to handle finding the dynamic addresses, so they can always tunnel through the same localhost:port combo.

Pros:

Easy, already works
Client environment is local environment
Requires no IT support
Flexible

Cons

Manual process, not as easy as other deployments
Requires per-user process on edge node

3. Jupyter Gateway

dask-enterprise 003

This would use https://github.com/jupyter-incubator/enterprise_gateway and jupyterhub to deploy remote kernels as a service. I think (but haven't attempted) that adding deployment using skein should be straightforward with their pluggable design.

In this situation, the user would login to jupyterhub, specify what environment they want to deploy, and then click "start notebook". This would start a remote kernel on the cluster, managed by a skein application master. The skein specification that started that kernel could also include other services like dask, which could be scaled up or down by the user easily from inside the notebook using the skein library (already works fine).

Pros:

Uses existing technology
The pluggable security nature of jupyterhub makes login management easier (we are not security experts)
Single application deployed on an edge node, users don't interact with edge node at all
Only requires one port to be opened on the cluster
The traffic being proxied is lighter (browser -> kernel) compared to that between the dask client and dask scheduler.

Cons:

Requires IT support and privileged account to deploy
Client environment is also remote, must be bundled up the same as all the worker environments.
Doesn't (currently) play well with other enterprise solutions like AE5

4. Dask Gateway

dask-enterprise 004

In this case users would start a jupyter kernel anywhere (locally, in AE5, wherever), and then contact a dask specific server on the edge node to spin up a cluster and manage communications for them. This is perhaps the nicest solution, but also the most complicated. It would look something like:

A single server process running on the edge node, communicating outside the cluster via a single (or a couple) open ports.
Users would ask the server to spin up a dask cluster for them using YARN with certain files/environments distributed to each worker. This could use Skein and mostly already works (delegation not done yet, but easily addable).
The server would then need to proxy traffic from the client outside the cluster to the specific scheduler address inside the cluster. I think this would require additions/changes to the dask client->scheduler protocol. This is trickier than proxying HTTP, as we can't use the path in the HTTP request to do the dispatching, the information on what scheduler to forward to will need to be part of the TCP message and would be dask specific.
The server would also need to proxy the dask dashboard outside the cluster. This could use an already built http proxy, or something custom.
Since we'd be exposing TCP connections to the schedulers outside the cluster, we'd need to include some security mechanism to prevent access to clusters between users. Contrasting with HTTP security for remote kernels, which has many pre-baked solutions, this would need to be something custom. I'd probably design this to have the client/scheduler/workers pass around an authentication token that was created after authenticating with the dask gateway (which could use pluggable security a. la. jupyterhub)

Pros

Client environment is local environment
Single application deployed on an edge node, users don't interact with edge node at all
Only requires one port to be opened on the cluster
Probably the nicest end solution overall

Cons:

Requires IT support and privileged account to deploy
Trickier, requires lots of custom code and security handling
Traffic being proxied (client -> scheduler) is heavier than browser to notebook

jcrist commented 6 years ago

In the immediate future, I plan to make the manual methods (1 and 2) easier to do. I think this covers many use cases, and will let people try dask out without requiring IT/corporate support.

jcrist commented 6 years ago

Also cc @hussainsultan.

martindurant commented 6 years ago

You might also want to consider my simplistic approach, which may be useful in some cases: that a port is opened on the edge node per dask scheduler that is launched and a proxy forwarder set up (I experimented with https://github.com/ArashPartow/proxy , but SSH I suppose can do this), and IT allows for a range of ports to be opened for the dask scheduler service, managed on the edge node directly by the skein daemon or by a very simplistic gateway service. I regard this as the simplest solution, if perhaps not very palatable.

mrocklin commented 6 years ago

For the proxy I was previously assuming that this would be some generic TCP thing. Now it seems like this might be more dask specific. Dask already has some infrastructure to handle things like this. For example we're accustomed to having the scheduler proxy messages between the client and workers without much deserialization (though this could be improved). @pitrou 's work on the comms infrastructure might also provide an opportunity to build some new proxy'ed protocol: http://distributed.readthedocs.io/en/latest/communications.html .

The inproc:// protocol is probably a good example of a protocol that includes both an top-level address and then a secondary address (pid/queue number)

In [1]: from dask.distributed import Client

In [2]: client = Client(processes=False)

In [3]: client.scheduler.address
Out[3]: 'inproc://192.168.50.100/29369/1'

In [4]: client.cluster.workers[0].address
Out[4]: 'inproc://192.168.50.100/29369/2'

mrocklin commented 6 years ago

For the solution proposed by @martindurant it would be useful to know how palatable this is to current IT groups.

martindurant commented 6 years ago

it would be useful to know how palatable this is to current IT groups.

indeed, and I would be the first to admit that I don't really have a guess - probably depends on whether there is a maximum of 10 schedulers, or 1000. The dashboard can of course go through a normal HTTP proxy.

Interesting idea to use existing dask comms for the gateway.

mrocklin commented 6 years ago

cc @mariusvniekerk

mrocklin commented 6 years ago

Also cc @ericdill

mariusvniekerk commented 6 years ago

So one thing that might be feasible for some of this is seeing how much we could leverage the yarn proxy. Depending on configuration that already bakes in a decent chunk of security for us.

This could allow situations like 3 and 4 potentially without needing something heavy on the edge node.

mrocklin commented 6 years ago

Can you say more about this @mariusvniekerk ? Is the Yarn proxy happy to route TCP connections around?

martindurant commented 6 years ago

What is "the YARN proxy"?

jcrist commented 6 years ago

The YARN proxy is a proxy that YARN manages that applications can tie into to get easy tie in with hadoop security. I investigated using this in the early days of skein, but abandoned it due to the following reasons:

It is only an HTTP proxy, dask communicates via tcp
It only handles GET requests (and POST in very recent releases, none of which Cloudera or Hortonworks deploys)
It doesn't handle websockets, which the dask dashboard uses

mrocklin commented 6 years ago

I spoke with @ericdill a month or so ago and he was saying that having a client not on the yarn cluster isn't actually that useful because you'll need to be able to see HDFS and all that. This probably means that the dask tcp proxy idea isn't as valuable as I had previously thought.

jcrist commented 6 years ago

It could still be useful if we switch our filesystem operations like glob to run as a worker instead of from a client. Since all actual reads happen on the workers, this would make the proxy complete. Proxying at the notebook level though using something like Jupyter enterprise-gateway would be easier though, and IMO a more robust (but less flexible) option.

martindurant commented 6 years ago

Interesting point about "remote file-systems" in the comment above. Is this something that filesystems could help prescribe, i.e., operations on an arbitrary file-system proxied via a gateway that we well understand? Or would you want a dask-specific solution (put a list of URLs into a future from a task)?

dask / distributed