Closed dchudz closed 3 months ago
@fjetter Those two settings should be enough to make it reasonable for user to try adaptive, right?
@hendrikmakait @gjoseph92 I've been told you might know things about the status of work stealing...?
Will adaptive actually work with these improved settings, or is "work stealing" not in a place where work will actually succeed in getting transferred to the new workers?
Oh, easier than what I said: cluster.adapt(interval="10s", target_duration="210s")
@scharlottej13 maybe one for the docs?
We should make this a platform default instead, right @shughes-uk ?
Will adaptive actually work with these improved settings
It will be better but I doubt it will be a pleasant experience. See also https://github.com/coiled/engineering/issues/23#issuecomment-1228269280
, or is "work stealing" not in a place where work will actually succeed in getting transferred to the new workers?
There are still many problems around work stealing and adaptive itself
https://github.com/dask/distributed/labels/adaptive
We should make this a platform default instead, right @shughes-uk ?
Yes, if anything we should set this as default. Especially as long as the following tickets are not addressed there is not even a convenient way to set any configs (which by itself should have high prio)
Coiled sets these values when turning on adaptive for a cluster:
target_duration="3m",
wait_count=24,
interval="5s",
If anyone wants to use different values, the way to do this is by specifying using adapt
method on Cluster
, for example:
cluster = coiled.Cluster(n_workers=[1, 10])
cluster.adapt(target_duration="1m")
These are some default settings for adaptive in distributed:
For Coiled (which launches instances in minutes...) these target durations are way too fast.
Something like this would be better:
Coiled ought to implement these better defaults in our clusters, but in the meantime a user can set them via:
cluster coiled.Cluster(..., environ={"DASK_DISTRIBUTED__SCHEDULER__ADAPTIVE__INTERVAL": "10s", "DASK_DISTRIBUTED__SCHEDULER__ADAPTIVE__TARGET_DURATION": "210s"})
Note that most users haven't used adaptive with Coiled, so this is a little bit experimental.