coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

advice for using adaptive #194

Closed dchudz closed 3 months ago

dchudz commented 2 years ago

These are some default settings for adaptive in distributed:

  adaptive:
    interval: 1s         # Interval between scaling evaluations
    target-duration: 5s  # Time an entire graph calculation is desired to take ("1m", "30m")
    minimum: 0           # Minimum number of workers
    maximum: .inf        # Maximum number of workers
    wait-count: 3        # Number of times a worker should be suggested for removal before removing it

For Coiled (which launches instances in minutes...) these target durations are way too fast.

Something like this would be better:

adaptive:
    interval: 10s
    target-duration: 210s

Coiled ought to implement these better defaults in our clusters, but in the meantime a user can set them via:

cluster coiled.Cluster(..., environ={"DASK_DISTRIBUTED__SCHEDULER__ADAPTIVE__INTERVAL": "10s", "DASK_DISTRIBUTED__SCHEDULER__ADAPTIVE__TARGET_DURATION": "210s"})

Note that most users haven't used adaptive with Coiled, so this is a little bit experimental.

dchudz commented 2 years ago

@fjetter Those two settings should be enough to make it reasonable for user to try adaptive, right?

dchudz commented 2 years ago

@hendrikmakait @gjoseph92 I've been told you might know things about the status of work stealing...?

Will adaptive actually work with these improved settings, or is "work stealing" not in a place where work will actually succeed in getting transferred to the new workers?

dchudz commented 2 years ago

Oh, easier than what I said: cluster.adapt(interval="10s", target_duration="210s")

shughes-uk commented 2 years ago

@scharlottej13 maybe one for the docs?

dchudz commented 2 years ago

We should make this a platform default instead, right @shughes-uk ?

fjetter commented 2 years ago

Will adaptive actually work with these improved settings

It will be better but I doubt it will be a pleasant experience. See also https://github.com/coiled/engineering/issues/23#issuecomment-1228269280

, or is "work stealing" not in a place where work will actually succeed in getting transferred to the new workers?

There are still many problems around work stealing and adaptive itself

https://github.com/dask/distributed/labels/adaptive

We should make this a platform default instead, right @shughes-uk ?

Yes, if anything we should set this as default. Especially as long as the following tickets are not addressed there is not even a convenient way to set any configs (which by itself should have high prio)

ntabris commented 3 months ago

Coiled sets these values when turning on adaptive for a cluster:

target_duration="3m",
wait_count=24,
interval="5s",

If anyone wants to use different values, the way to do this is by specifying using adapt method on Cluster, for example:

cluster = coiled.Cluster(n_workers=[1, 10])
cluster.adapt(target_duration="1m")