coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Coiled cluster with adaptive scaling : trying to match Nebari demo #285

Closed rsignell closed 1 week ago

rsignell commented 1 week ago

I have a standard live demo I usually run on Nebari (Jupyterhub with Dask Gateway on Kubernetes) where I start a cluster with a small number of workers, then load a time series from object storage with a lot of chunks, and show in the Dask Dashboard how the process starts of slowly, then speeds up dramatically after a minute as the cluster scales from 2 to 20 workers (with 1 cpu each).

I wanted to do the same demo with Coiled, so fired up a cluster thusly:

 if cluster_type == 'Coiled':
    import coiled

    cluster = coiled.Cluster(
        region="us-west-2",
        compute_purchase_option="spot_with_fallback",
        arm=True,
        n_workers=1, 
        software="pangeo_arm",
        worker_vm_types=['c7g.large'],   # ARM instance with 2 vCPU, 4GB RAM
        workspace='esip-lab'
    )

    client = cluster.get_client()

    # Scale the cluster
    cluster.adapt(minimum=1, maximum=10)

But on Coiled, when the cluster spun up (again after about a minute), I didn't get 9 more workers, I only got 1 more worker): Screenshot 2024-06-17 105750

I'm guessing that if my workflow took longer, Coiled would eventually scale up to the maximum of 10 workers. But is there anything I can do to tell Coiled to spin up more of the remaining workers when it scales up?

In case it's useful, this was cluster: esip-lab-9a779b10

ntabris commented 1 week ago

Took a quick look at the staff-only metrics and I can see that adaptive target never went above 2:

image

https://cloud.coiled.io/clusters/498966/account/esip-lab/information?tab=Metrics&sinceMs=1718636161395&untilMs=1718636311395

Coiled is relying on adaptive target. We don't have any special logic to let you say things like "go above the adaptive target by X%" or "scale up by N workers at a time".

Side note/FYI: you can now use cluster = coiled.Cluster(n_workers=[1, 10], ...) to set adaptive range when making the cluster.

rsignell commented 1 week ago

Nat, thanks but I don't quite understand why it didn't scale beyond 2. When you say "Coiled is relying on adaptive target", isn't Dask Gateway using the same reliance?

mrocklin commented 1 week ago

Adaptive has various parameters, one of them is "scale up the computation to finish in X time" for some value of X. Typically we set X to be some factor greater than the worker spinup time. If it takes 5s to spin up a worker, well then let's spin up enough to finish the job in 20s. That way most of the time you have a worker it's doing something.

Because Coiled has a ~1 minute spinup time we set X to be 5 minutes I think.

If this is what's going on, then there are two possibilities to make your demo run more smoothly:

  1. Give Dask more work
  2. Change the value of X (actually named target_duration=

https://docs.dask.org/en/stable/adaptive.html#scaling-heuristics

rsignell commented 1 week ago

I gave this a shot with cluster esip-lab-ba29b751, but I didn't get more than 2 workers, even after more than 3 minutes. Perhaps I didn't specify the env vars correctly? I thought it was correct because I followed this post by @dchudz: https://github.com/coiled/feedback/issues/194#issue-1354483846

Screenshot 2024-06-17 163039

Screenshot 2024-06-17 162855

mrocklin commented 1 week ago

You should also be able to call cluster.adapt(..., target_duration="30s")

How long does the computation run when you have only two workers?

rsignell commented 1 week ago

Woohoo @mrocklin ! That worked (and is much cleaner):

Screenshot 2024-06-17 164543

Screenshot 2024-06-17 164450

mrocklin commented 1 week ago

Oh cool, you can even see the transfer to each machine of some piece of data when it comes online

mrocklin commented 1 week ago

@rsignell if you want to extend that demo with Coiled dashboard stuff you might appreciate the Infrastructure view

https://cloud.coiled.io/clusters/499166/account/esip-lab/infrastructure?scopes=%7B%22type%22%3A%22account%22%2C%22id%22%3A6979%2C%22name%22%3A%22esip-lab%22%2C%22organizationId%22%3A7549%2C%22slug%22%3A%22esip-lab%22%7D

Screenshot 2024-06-17 at 5 33 56 PM
mrocklin commented 1 week ago

It gives a good understanding of what's going on behind the scenes.