dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 146 forks source link

KubeCluster with service type LoadBalancer on AKS stuck in "Pending" state on startup #906

Closed catzc closed 3 weeks ago

catzc commented 3 weeks ago

Describe the issue: I'm trying to create a KubeCluster with AKS LoadBalancer, but the DaskCluster is stuck in a pending state at startup.

Minimal Complete Verifiable Example:

Trying to create a cluster like this:

def make_sim_cluster(
        name: str,
        cpu_cores: int,
        memory_in_gb: int, 
        image: str = '<image>',
        tag: str = 'x.x.x',
        namespace: str = 'namespace',
):
    base_cluster_spec = make_cluster_spec(
        name=name,
        image=f'{image}:{tag}', 
        n_workers=cpu_cores,
        scheduler_service_type="LoadBalancer",
    )
    base_cluster_spec['metadata']['annotations'] = {
        "service.beta.kubernetes.io/azure-load-balancer-internal": "true",
    }

    return KubeCluster(custom_cluster_spec=base_cluster_spec, namespace=namespace)

but the DaskCluster is stuck in Pending State until it times out image image

Anything else we need to know?:

I looked around at our service spec and it seems like our resource definition has our load balancer under a key "loadBalancer" when the scheduler service comes up (as referenced in kubernetes doc https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) image while dask-kubernetes seems to look around for "load_balancer" in status: https://github.com/dask/dask-kubernetes/blob/2ecfdcd41b0e09665d4ff98cb40528218fea10fe/dask_kubernetes/operator/controller/controller.py#L403-L406

for what it's worth, the kr8s library also seems to look around for "load_balancer".

When I switched these over to using "loadBalancer" the DaskCluster was able to come up without any issues. Actually even when I didn't change the code, the service, pods, deployments etc were all able to be setup and in a running state, only the DaskCluster was pending.

Any helps is great, thanks!

Environment:

jacobtomlinson commented 3 weeks ago

Thanks for raising this. Internally kr8s uses box to make accessing nested data structures more pleasant and pythonic. At one point early on we were using the camel killer box to make keys like loadBalancer more pythonic by converting them to snake case like load_balancer.

>>> import box
>>> data_in = {"loadBalancer": True}
>>> data = box.Box(data_in, camel_killer_box=True)

You can access the loadBalancer key using either .loadBalancer or .load_balancer. Which is very pleasant.

>>> data.load_balancer
True
>>> data.loadBalancer
True

Unfortunately this operation in box is destructive and modifies the underlying data structure to store data in the snake case form. Which means when converting back to a dictionary you don't get the same structure that you put in.

>>> data.to_dict()
{'load_balancer': True}

Given that kr8s needs to be able to update data back to the Kubernetes API we can't be mangling the data structures like this. So at some point we dropped the camel killer box.

It looks like this bit of code where we look up the load balancer info is still using the snake case and needs updating to camel case.

jacobtomlinson commented 3 weeks ago

I've opened up #907 to resolve this here. @blankhold could you confirm that these are the changes you made that fixed things for you? It's a little hard to test this fix as we can't provision loadbalancers easily in our test suite.

catzc commented 3 weeks ago

Awesome, thanks for the explanation and PR @jacobtomlinson!

Seems like there are a few more patches I had to make. I pulled in the changes from your PR and also the changes you recently made in kr8s https://github.com/kr8s-org/kr8s/commit/665ff53e95c1808e80fb038e3f9ed2e7858275de.

Some other things:

  1. Startup still gets stuck in _wait_for_controller.

    • I got around this by having it asyncio.sleep(5) then immediately return, skipping the check. (For some reason the status in the daskcluster is never being updated to phase: "Running"). Doing an immediate return will give us a BoxKeyError
  2. lb.hostname also gives a BoxKeyError

    • removed that entirely, but seems like we might need the default_box_none_transform for Box or refactoring that section

With those changes above, I was able to get a daskcluster with LoadBalancer service and access the dashboard externally

jacobtomlinson commented 3 weeks ago

Ok great! Perhaps it's easier if you put up a PR that supersedes mine that makes the changes you need and we can discuss specifics there.

catzc commented 3 weeks ago

Thanks! Created https://github.com/dask/dask-kubernetes/pull/908 which I think should fix the issues. Also, know when kr8s can make a new release to pick up your recent changes when we release a new version of dask-kubernetes? Or how will the release go so that we can finally pick up these changes once merged?

catzc commented 3 weeks ago

Didn't realize how to test the controller before, so my getting stuck in wait_for_controller commend might not be applicable/might just be fixed when rebuilding the controller to test. Will let you know if that worked

jacobtomlinson commented 3 weeks ago

I've just released kr8s version 0.17.4 and dask-kubernetes version 2024.9.0 with these fixes in.