Closed cagantomer closed 6 months ago
I can try to work on it though not sure I know enough yet. Is there a "development" guide that I can use to get up and running?
BTW, as a workaround, I am using the following to "manually" find and delete the nodes:
import pathlib
from kubernetes import client as k8sClient, config
DASK_CLUSTER_NAME_LABEL = "dask.org/cluster-name"
DASK_COMPONENT_LABEL = "dask.org/component"
def load_k8s_config():
"""load config - either from .kube in home directory (for local clients) or in-cluster (for e.g. CI)"""
if (pathlib.Path.home() / ".kube" / "config").exists():
config.load_kube_config()
else:
config.load_incluster_config()
def delete_adhoc_cluster(cluster_name: str, namespace: str):
"""delete the dask cluster created by KubeCluster"""
load_k8s_config()
v1 = k8sClient.CoreV1Api()
pods_to_delete = []
pods_list = v1.list_namespaced_pod(namespace=namespace)
for pod in pods_list.items:
if (
DASK_CLUSTER_NAME_LABEL in pod.metadata.labels
and pod.metadata.labels[DASK_CLUSTER_NAME_LABEL] == cluster_name
):
pods_to_delete.append(
dict(
name=pod.metadata.name,
component=pod.metadata.labels.get(DASK_COMPONENT_LABEL, "n/a")
)
)
for pod in pods_to_delete:
print(f"deleting {pod['name']} ({pod['component']})...")
v1.delete_namespaced_pod(pod['name'], "research")
Apologies looks like I had typed a response here but forgot to press the button, nice that GitHub saves these things!
Thanks @cagantomer I am able to reproduce this locally. My guess from looking at the traceback is that the cluster manager is unable to connect to the scheduler after things start closing out. Perhaps it's worth exploring the port forwarding code in case it has been closed.
You might also be interested in trying the operator we've been working on which is intended to supersede the current KubeCluster
implementation. One of the goals of the operator is to handle lifecycle things like this in a more k8s native way.
I saw the operator efforts and I kind of have mixed feeling towards it. I am reluctant regarding anything that requires extra installations :-)
I really liked the KubeCluster experience - just having the right permissions to k8s and everything works relatively smoothly...
What are the advantages of using the operator?
I am reluctant regarding anything that requires extra installations :-)
That's totally understandable. I'm optimistic that the couple of extra lines you need to run the first time you use it will be worth it.
just having the right permissions to k8s and everything works relatively smoothly...
We often find that this is rare and folks don't have the right permissions. Particularly creating pods as that can be a large security risk.
What are the advantages of using the operator?
I'll try to enumerate some of the goals:
Switching to Custom Resource Definitions and an Operator hugely simplifies the client-side code and provides much better decoupling. This will help maintainability as the current KubeCluster
has grown unweildy.
In multi-user clusters an admin can install the operator once for the cluster and then user's can just happily create Dask clusters.
With the current KubeCluster
all state is tied to the instance of the object in Python, which means clean-up or reuse can be unpleasant if the Python process ends unexpectedly. In the new one state lives in k8s and clusters can be created/scaled/deleted in a k8s native way using kubectl
in addition to the new experimental KubeCluster
.
Many folks use Dask clusters in a multi-stage pipeline, where each stage is a separate Python script or notebook. The current KubeCluster
cannot persist between stages, but clusters created with the operator can.
We are also in the process of adding a DaskJob
CRD which will behave much like a k8s Job
resource but with a Dask cluster attached. This should feel familiar to Kubeflow users who use the various training operators it provides.
The decoupled state means we have been able to implement support for dask-ctl so that clusters can be managed via the CLI/API that provides along with the new CLI dashboard (once we get that finished).
We currently struggle with supporting clusters that have Istio installed due to the way our pods talk directly to each other. The operator aims to solve that by handling the complex services required in a way that is transparent to the user.
The autoscaling logic is moved from the cluster manager to the operator which will also benefit multi-stage pipeline workflows.
@jacobtomlinson - thanks for the information. Once we are stable with our current implementation, I'll look into trying the operator (with hope the DevOps will help with it 😄)
The classic KubeCluster
was removed in https://github.com/dask/dask-kubernetes/pull/890. All users will need to migrate to the Dask Operator. Closing.
What happened:
I am using
KubeCluster
for running ad-hoc dask cluster as part of my script.I am trying to gracefully shut down the cluster in case there was an interrupt (SIGINT / ctrl-C, SIGKILL) with the example below.
When the scripts runs without interrupt, it will correctly clean up the cluster created. When interrupt it, with ctrl-c, the handler catches the exception, it but calling
client.cluster.close
in the finally clause raises a timeout error (see below). The workers and scheduler pods remain active and not terminated.What you expected to happen: The pods of the ad-hoc cluster should terminate and there should not be any exception.
Minimal Complete Verifiable Example:
The output from execution:
Anything else we need to know?:
Click ctrl-c once "waiting for tasks to finish..." appears on the screen.
Also, not related, but the dashboard_link always shows localhost instead of something more useful (I recall reading a discussion about it - but it was a while ago).
Environment: