Spawn Dask scheduler on separate Kubernetes pod

chrish42 commented 6 years ago

This feature request is a followup to #82. Having at least the option to spawn the Dask scheduler on a remote Kubernetes pod would enable more use cases for Dask-kubernetes, including at least the "run Jupyter on my notebook, but do the Dask computation on a remote cluster" one.

mrocklin commented 6 years ago

I agree that this would be useful. It also seems doable to me, but would require moderate effort.

I'm a bit curious how often people have access to a Kubernetes cluster when they're not in the cluster themselves.

On Mon, Jul 9, 2018 at 4:02 PM, Christian Hudon notifications@github.com wrote:

This feature request is a followup to #82 https://github.com/dask/dask-kubernetes/issues/82. Having at least the option to spawn the Dask scheduler on a remote Kubernetes pod would enable more use cases for Dask-kubernetes, including at least the "run Jupyter on my notebook, but do the Dask computation on a remote cluster" one.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/84, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJ5ei1hMTwHvSOepbQqFn2Xs2ROIks5uE7bHgaJpZM4VITcF .

jacobtomlinson commented 6 years ago

This is definitely something we would use in anger.

Running Jupyter on a laptop and then connecting to a cloud based cluster for computation would be really useful.

mrocklin commented 6 years ago

@jacobtomlinson is this something that is important enough to you all to help develop? I imagine that some parts of this, like making adaptive work across a network, I'll probably have to do, but I'm curious if there are parts for which I can lean on others. My guess is that the first pass of just putting the scheduler in another pod is probably something that you are as if not more capable of than I am :)

On Tue, Jul 10, 2018 at 3:28 AM, Jacob Tomlinson notifications@github.com wrote:

This is definitely something we would use in anger.

Running Jupyter on a laptop and then connecting to a cloud based cluster for computation would be really useful.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/84#issuecomment-403727932, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCnnQNZDziUxrF9J7VD-3YTPfmDdks5uFFe5gaJpZM4VITcF .

jacobtomlinson commented 6 years ago

@mrocklin I will have to check my schedule and see if I can dedicate time to this.

chrish42 commented 6 years ago

@mrocklin From my perspective, everyone is 5 minutes (and a credit card) away from having access to a Kubernetes cluster when they outgrow their laptop for computations. Especially if #85 is also implemented. Then they don't need to install anything else but dask-kubernetes; they just instantiate a dask_kubernetes.Cluster object, passing it the cluster credentials given when they created the cluster in the GCP (or other) console. It's getting to a JupyterHub-like setup on Kubernetes, where people can launch notebooks, etc. that requires more work (and/or money), and has potentially a smaller audience.

jgerardsimcock commented 6 years ago

We would really like to explore the idea of having there be a central scheduler that users via jupyter notebook/terminal could log into and run large batch jobs. I'd like to understand how to do something like the financial modeling team in the dask use cases

What is the best place to start on getting this moving?

putting the scheduler in another pod (I have no idea how to do this)
?

mturok commented 6 years ago

+1

arpit1997 commented 5 years ago

Is it still open? Can I try this?

@jgerardsimcock if we spawn dask scheduler in another pod and let scheduler manage jobs wouldn't that be enough? Am I missing something here?

@chrish42 Certainly it would have small audience but it would really help the data scientists kind of people to just connect with a cluster and do processing on huge datasets.

rmccorm4 commented 5 years ago

@mrocklin @jacobtomlinson Am I wrong in thinking that you could create and start the KubeCluster client/scheduler on a master in the Kubernetes cluster, serialize/send the cluster/client object to your laptop, and issue client.submit(func, x) commands from your laptop? (Which will just work with because of the existing dask.distributed functionalities since this client object will have the IP and port of the remote master you started it on?)

Or is this issue saying a whole new RPC setup needs to be developed, where you start some listening socket on a master in the cluster that will take RPCs to initialize dask-kubernetes cluster/client/scheduler, take RPCs to submit tasks, take RPCs to scale, etc.?

Or neither? I'm just trying to better understand the problem statement here.

mrocklin commented 5 years ago

You could do something like that, yes, but it's a bit reversed of what I suspect will be a common workflow.

The situation we care about is that someone has a laptop and they have a kubernetes cluster. From their laptop they want to ask Kubernetes to create a Dask scheduler and a bunch of Dask workers that talk to the scheduler, and then they create a client locally on their laptop and connect to that scheduler.

Currently with the way KubeCluster is designed the scheduler is only created locally, so the user has to be on the Kubernetes cluster. So instead of something like this.

class KubeCluster:
    def start(...)
        self.scheduler = Scheduler(...)  # actually this happens inside of the `self.cluster = LocalCluster(...)` today

We want something like this

class KubeCluster:
    def start(...):
        self.scheduler_pod = kubernetes.create_pod(scheduler_pod_specification)
        self.scheduler_connection = connect_to(scheduler_pod)

Or at least we want the option to do that. There will be other cases where we'll want the old behavior as well.

martindurant commented 5 years ago

Doesn't dask-yarn work with a scheduler-in-a-container?

mrocklin commented 5 years ago

Yes

On Wed, Mar 6, 2019 at 11:10 AM Martin Durant notifications@github.com wrote:

Doesn't dask-yarn work with a scheduler-in-a-container?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/84#issuecomment-470192692, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszFj3vyE0ufHSDx_9qdhSYTyEPkgcks5vT_aXgaJpZM4VITcF .

dhirschfeld commented 5 years ago

I think this refactor would be fairly tightly coupled to whatever results from https://github.com/dask/distributed/issues/2235?

beberg commented 5 years ago

Now that the Helm chart updates are in progress in #128 we have a better handle on the use cases and variables needed to put the scheduler in cluster.

Add an optional scheduler spec to KubeCluster.
Add additional parameters~1:1 with the Helm chart variables.
Usage may be tricky to keep 100% backward compatible, but can still default to local scheduler.
The kwargs approach doesnt scale great. The YAML defaults + YAML overrides may be an easier and clearer UX, matching kubectl, Helm, konfigure, etc in the K8s ecosystem.
Depending on the type of cluster and the NodePort/LoadBalancer/ClusterIP, things get messy so this will need some care, similar to how the chart spits out all the needed addresses.

mrocklin commented 5 years ago

We currently ask the user for a worker pod specification template. We should probably expand this to also include a scheduler pod specification template. I agree that expanding everything out as keyword arguments would probably be difficult to maintain.

So I think that all we really need here from an API perspective is to add a scheduler_pod_template= keyword, and then we should be good to go (at least until we actually have to implement things).

SeguinBe commented 5 years ago

Maybe that will be of some help to others, but we had the situation where we wanted to have the client on our own machines and just dispatch dask workers/scheduler to kubernetes. So I made a tiny library to do it, it is not very well tested but it works on our University settings.

https://github.com/SeguinBe/dask_k8

jacobtomlinson commented 5 years ago

This has now been implemented in #162 and can be enabled with cluster = KubeCluster(deploy_mode="remote", ...).

This created the scheduler in a pod on the cluster and also creates a service to enable traffic to be routed to the pod. This is a ClusterIP service by default but can be configured to use a LoadBalancer.

This is currently not the default as it is non-trivial to route traffic to the dashboard using this setup.

Further testing and development would be great!

wwoods commented 5 years ago

I may not be following this thread correctly, but running dask_kubernetes.KubeCluster.from_yaml(..., deploy_mode="remote") is still resulting in a hanging client when I run it on my laptop (though the scheduler does appear to run in its own pod). What additional routing/configuration is needed to have a local laptop installation talk to a dask_kubernetes cluster spawned via from_yaml? dask_kubernetes version 0.10.0

jacobtomlinson commented 5 years ago

@wwoods the scheduler will be placed behind a service, by default this will be of type LoadBalancer. The client will then attempt to connect to the address of the service.

So if you leave the defaults then your kubernetes cluster must be able to assign a Load Balancer and your client must be able to connect to it. If you directly have access to the internal kubernetes network then I recommend setting the service type to ClusterIP which means the client will connect directly to the kubernetes service, but you must be able to route to it.

Without knowing more about you kubernetes cluster setup I can't provide more advice.

wwoods commented 5 years ago

@jacobtomlinson Just trying to get this working in a "kind" context (local cluster), from scratch. It looks like the service is configured as ClusterIP... Here's the error:

Timed out trying to connect to 'tcp://dask-waltw-bfc87757-2.default:8786' after 10 s: [Errno -2] Name or service not known

And here's the output of kubectl get service:

dask-waltw-bfc87757-2 ClusterIP 10.111.76.46 <none> 8786/TCP,8787/TCP 69s

It looks like the service.namespace which dask-kubernetes is trying to connect to is correct, so perhaps kind hasn't configured the local DNS correctly? Not knowing more about kubernetes, it's hard to say if this is because I'm misusing deploy="remote". From the original use case though, it seems like running a python script locally and doing the dask computation remotely is exactly the use case.

jacobtomlinson commented 5 years ago

It looks like your local machine can't resolve kubernetes DNS in your local cluster. If you were using a managed cluster with a load balanacer you would be provided with an address you could resolve.

You will need to configure things correctly to enable this. What local implementation are you using? (docker desktop, k3s, minikube, etc)

wwoods commented 5 years ago

@jacobtomlinson right - I haven't used kubernetes much, but this particular bit of documentation is proving rather hard to find. I know it's not exactly a dask-kubernetes problem, but I do appreciate you pointing me in the right direction. It's somewhat necessary information to take dask-kubernetes for a test drive.

I was using kind (https://github.com/kubernetes-sigs/kind), but I'm not attached. Also tried minikube since it explicitly mentioned DNS, but no luck.

jacobtomlinson commented 5 years ago

I'm afraid I don't have experience with kind. I personally use docker desktop on my Mac for local development. We use minikube for CI and testing.

wwoods commented 5 years ago

Any idea how you'd get it working with minikube?

On Thu, Nov 14, 2019, 7:33 AM Jacob Tomlinson notifications@github.com wrote:

I'm afraid I don't have experience with kind. I personally use docker desktop on my Mac for local development. We use minikube for CI and testing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/84?email_source=notifications&email_token=AABWH2B33NIAMW72QDQDX5LQTVVVPA5CNFSM4FJBG4C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECHOEA#issuecomment-553940752, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABWH2BWN2EPNWYECVGNHXLQTVVVPANCNFSM4FJBG4CQ .

jacobtomlinson commented 5 years ago

If you have a look at the ci config you can see how it is done there.

detroyejr commented 4 years ago

And here's the output of kubectl get service:

dask-waltw-bfc87757-2 ClusterIP 10.111.76.46 <none> 8786/TCP,8787/TCP 69s

It looks like the service.namespace which dask-kubernetes is trying to connect to is correct, so perhaps kind hasn't configured the local DNS correctly? Not knowing more about kubernetes, it's hard to say if this is because I'm misusing deploy="remote". From the original use case though, it seems like running a python script locally and doing the dask computation remotely is exactly the use case.

My Kubernetes cluster is provisioned with KOPS which doesn't create a load balancer in AWS by default. I was looking for a good solution yesterday and discovered kubefwd which will add Kubernetes services to your hosts file so that urls like dask-waltw-bfc87757-2:8787 are accessible locally. My limited testing shows that it works, but maybe isn't as stable as a load balancer would be. Hope this helps someone.

epizut commented 4 years ago

Hi,

I am in the same situation here. I am trying to run a remote scheduler and workers on a managed k8 from my laptop over the internet.

The auth part is fine with the recent kubeconfig dev (thanks a lot)
kubernetes.scheduler-service-type=LoadBalancer allows me to access the Dashboard where I see my worker(s) (dask.config thanks to Nuno)
distributed.comm.timeouts.connect=180 helps too (dask.config)
scheduler_service_wait_timeout=180 idem (KubeCluster constructor)

But then I got both a remote scheduler exception (on k8 logs):

[...]
distributed.scheduler - INFO -   Scheduler at:   tcp://100.64.1.223:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection. at 0x7f74b62473a0>,  exception=KeyError('pickle-protocol')>)
[...]
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 222, in on_connection
comm.handshake_options = comm.handshake_configuration(
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 138, in handshake_configuration
"pickle-protocol": min(local["pickle-protocol"], remote["pickle-protocol"])
KeyError: 'pickle-protocol'
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://100.64.1.223:8786 remote=tcp://x.x.x.x:53488>
distributed.scheduler - INFO - Register worker <Worker 'tcp://100.64.0.56:40725', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://100.64.0.56:40725

and a KubeCluster exception while computing the ipython output on my laptop kernel:

[...]/cluster.py in __repr__(self)
390             self._cluster_class_name,
391             self.scheduler_address,
--> 392             len(self.scheduler_info["workers"]),
393             sum(w["nthreads"] for w in self.scheduler_info["workers"].values()),
394         )
KeyError: 'workers'

I believe managed k8 users are more data science people than k8 admins. So maybe the "LoadBalancer" solution can be a bit more documented?

Also dask-gateway seems overkill for a single user.

What do you think?

jacobtomlinson commented 4 years ago

That sounds to me like a version mismatch in Dask between the scheduler, workers or client. I'm not sure it is really related to this issue. Could you check all your Dask versions are up to date and if you're still having trouble raise a new issue.

I also agree that dask-gateway is overkill for a single user. That's not really its target audience. We have dask-kubernetes and the Dask helm-chart for single users, and dask-gateway (and soon the new dask-hub helm chart) for teams.

epizut commented 4 years ago

You are right, It's working for me now.

For the record I had to face:

Of course, deploy_mode=remote and scheduler-service-type=LoadBalancer
Increase distributed.comm.timeouts.connect (my managed k8 takes about 15s to start the sched pod)
Strips dots in username while waiting for the next pypi package https://github.com/dask/dask-kubernetes/pull/225. The doc points to kubernetes.worker-name but I only had success with kubernetes.name or KubeCluster(name)

Thank again Jacob

jacobtomlinson commented 2 years ago

I'm going to close this out as this is now the default behaviour.

dask / dask-kubernetes

Spawn Dask scheduler on separate Kubernetes pod #84