Closed rsignell closed 1 week ago
Took a quick look at the staff-only metrics and I can see that adaptive target never went above 2:
Coiled is relying on adaptive target. We don't have any special logic to let you say things like "go above the adaptive target by X%" or "scale up by N workers at a time".
Side note/FYI: you can now use cluster = coiled.Cluster(n_workers=[1, 10], ...)
to set adaptive range when making the cluster.
Nat, thanks but I don't quite understand why it didn't scale beyond 2. When you say "Coiled is relying on adaptive target", isn't Dask Gateway using the same reliance?
Adaptive has various parameters, one of them is "scale up the computation to finish in X time" for some value of X. Typically we set X to be some factor greater than the worker spinup time. If it takes 5s to spin up a worker, well then let's spin up enough to finish the job in 20s. That way most of the time you have a worker it's doing something.
Because Coiled has a ~1 minute spinup time we set X to be 5 minutes I think.
If this is what's going on, then there are two possibilities to make your demo run more smoothly:
target_duration=
https://docs.dask.org/en/stable/adaptive.html#scaling-heuristics
I gave this a shot with cluster esip-lab-ba29b751, but I didn't get more than 2 workers, even after more than 3 minutes. Perhaps I didn't specify the env vars correctly? I thought it was correct because I followed this post by @dchudz: https://github.com/coiled/feedback/issues/194#issue-1354483846
You should also be able to call cluster.adapt(..., target_duration="30s")
How long does the computation run when you have only two workers?
Woohoo @mrocklin ! That worked (and is much cleaner):
Oh cool, you can even see the transfer to each machine of some piece of data when it comes online
@rsignell if you want to extend that demo with Coiled dashboard stuff you might appreciate the Infrastructure view
It gives a good understanding of what's going on behind the scenes.
I have a standard live demo I usually run on Nebari (Jupyterhub with Dask Gateway on Kubernetes) where I start a cluster with a small number of workers, then load a time series from object storage with a lot of chunks, and show in the Dask Dashboard how the process starts of slowly, then speeds up dramatically after a minute as the cluster scales from 2 to 20 workers (with 1 cpu each).
I wanted to do the same demo with Coiled, so fired up a cluster thusly:
But on Coiled, when the cluster spun up (again after about a minute), I didn't get 9 more workers, I only got 1 more worker):![Screenshot 2024-06-17 105750](https://github.com/coiled/feedback/assets/125569335/9a2591da-5370-44dc-8a52-48f73b212186)
I'm guessing that if my workflow took longer, Coiled would eventually scale up to the maximum of 10 workers. But is there anything I can do to tell Coiled to spin up more of the remaining workers when it scales up?
In case it's useful, this was cluster: esip-lab-9a779b10