dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 717 forks source link

Adaptive goes into an endless loop when used on SpecCluster that starts several worker processes by jobs #7019

Open guillaumeeb opened 2 years ago

guillaumeeb commented 2 years ago

Hi there, we ran into an issue in dask-jobqueue that is explained here: https://github.com/dask/dask-jobqueue/issues/498.

To sum up: when we use adaptive mode with a Cluster object starting several worker processes in each Job, and with a minimum number of workers to 0, adaptive goes into a endless loop starting and stopping jobs before the workers ever connect to the Scheduler once tasks are submitted.

I'm thinking this bug is also present with other SpecCluster implementation which allow for "grouped" workers in one ProcessInterface.

This can be reproduced (without any job queuing system, but with dask-jobqueue) with the following snippet:

import time
from dask import delayed
from dask.distributed import Client, progress, LocalCluster
from dask_jobqueue.local import LocalCluster
import numpy as np

@delayed
def job(x):
    time.sleep(1)
    return x+1

cluster = LocalCluster(
        cores=2,
        processes=2,
        name='multi-worker',
        memory="2GiB",
        walltime='1:00:00'
        )
client = Client(cluster)

cluster.adapt(maximum_jobs=6, interval='100ms', wait_count=1)#Small interval and wait_count to simulate some queuing system startup time

njobs = 1000
outputs = []
for i in range(njobs):
    output = job(i)
    outputs.append(output)

results = client.persist(outputs)
print("Running test...")
progress(results)

I've narrowed the problem to two places:

I guess for some simple fix, we could either just modify the code in one or both places. We could suppress the target to 1 if no workers (but this is probably here for a reason), or we could prevent stopping not connected workers if no workers have yet arrived. These are just suggestions, I'm really curious of what any other propositions you might have.

Environment:

guillaumeeb commented 1 year ago

Hi there, any updates on this issue? Would it be easier for maintainers if I tried to propose something and we discuss the changes?

I hoped to have some feedback first, but I understand this is not of high priority and people here have many things to do, so I'm happy to give it a try in the coming weeks.

ricardobarroslourenco commented 1 week ago

Hello, any updates on this issue (@guillaumeeb did you receive any answer on this)?