dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
233 stars 140 forks source link

Handling workers with expiring allocation requests #122

Closed wgustafson closed 3 years ago

wgustafson commented 5 years ago

I am trying to figure out how to handle the case of dask workers getting bumped from a cluster due to their requested allocation time expiring. From the intro YouTube video at https://www.youtube.com/watch?v=FXsgmwpRExM, it sounds like dask-jobqueue should detect when a worker expires and automatically start a replacement, which is what I want. However, my testing on DOE's edison computer at NERSC is not getting that behavior. If it matters, edison uses SLURM.

I have tried setting up my cluster two ways and both behave the same. I start a worker that uses dask.delayed to do a bunch of embarrassingly parallel tasks, the server spawns one worker, that worker does the first task or two, the worker expires, the server seems to hang, and nothing else happens.

The first approach I used to setup the cluster was with "scale":

    cluster = SLURMCluster(cores=1, processes=1)  # need all the memory for one task
    cluster.scale(1)  # testing with as simple as I can get, cycling 1 worker
    client = Client(cluster, timeout='45s')

@josephhardinee suggested a 2nd approach using "adapt" instead:

    cluster = SLURMCluster(cores=1, processes=1)  # need all the memory for one task
    cluster.adapt(minimum=1, maximum=1)  # trying adapt instead of scale
    client = Client(cluster, timeout='45s')

The dask-worker.err log concludes with:

slurmstepd: error: *** JOB 10234215 ON nid01242 CANCELLED AT 2018-08-10T13:25:30 DUE TO TIME LIMIT ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.4.227:35634'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>

Am I expecting more from dask-jobqueue than I should? Or, is this a bug in my implementation or in dask.distributed of dask-jobqueue?

Thanks, Bill

guillaumeeb commented 5 years ago

Thanks for this really interesting test! I haven't encounter this yet, as I am currently only using dask for interactive analysis, and usually computations I launch have enough workers to end in a few minutes maximum.

However, I believe that, at least with adaptive cluster, for long running processes, workers should be restarted when allocation time expires. I will certainly need this in the near future. Currently, this is clearly not the case for non adaptive cluster. I cannot see anything in the code that would restart workers when a job ends. For adaptive, I don't know enough dask distributed code base yet, but the answer is probably there. Hopefully @mrocklin can answer this more easily.

I will try to reproduce your use case in my system newt week.

In any case, if this is not a bug of adaptive or our use of it, I think we should propose a solution here for having workers automatically restarted. Any advice @jhamman @mrocklin @lesteve ?

mrocklin commented 5 years ago

Am I expecting more from dask-jobqueue than I should?

What you're describing seems like it's in scope for dask-jobqueue, but it's not currently implemented. This could be resolved by recording a set desired number of workers and re-scaling whenever a worker goes away, which could probably be handled by the remove_worker method of the JobQueuePlugin(SchedulerPlugin) class.

guillaumeeb commented 5 years ago

but it's not currently implemented

So I take it adaptive clusters in distributed doesn't handle this either? Shouldn't it?

which could probably be handled by the remove_worker method of the JobQueuePlugin(SchedulerPlugin)

I had that in mind too, I'd be happy to work on it.

mrocklin commented 5 years ago

The adaptive class checks the scheduler state frequently, so I would expect it to handle this case decently well.

On Mon, Aug 13, 2018 at 1:24 AM, Guillaume EB notifications@github.com wrote:

but it's not currently implemented

So I take it adaptive clusters in distributed doesn't handle this either? Shouldn't it?

which could probably be handled by the remove_worker method of the JobQueuePlugin(SchedulerPlugin)

I had that in mind too, I'd be happy to work on it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/122#issuecomment-412411534, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNCD4cEK_JyqPxvWtgnFaEaAmNO8ks5uQQ2fgaJpZM4V44B5 .

mrocklin commented 5 years ago

I had that in mind too, I'd be happy to work on it.

Hooray!

On Mon, Aug 13, 2018 at 1:24 AM, Guillaume EB notifications@github.com wrote:

but it's not currently implemented

So I take it adaptive clusters in distributed doesn't handle this either? Shouldn't it?

which could probably be handled by the remove_worker method of the JobQueuePlugin(SchedulerPlugin)

I had that in mind too, I'd be happy to work on it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/122#issuecomment-412411534, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNCD4cEK_JyqPxvWtgnFaEaAmNO8ks5uQQ2fgaJpZM4V44B5 .

guillaumeeb commented 5 years ago

So i've tried that with and without adaptive. As expected, without adaptive, once the worker is killed, nothing happens anymore.

With adaptive scaling, in my setup, a worker is restarted every time the previous one dies. Cool, that's what I want.

However, I'm still running into a difficult issue. Here is the code I use for some background:

cluster.adapt(minimum=1, maximum=1)
client = Client(cluster)
import time
def my_slow_task(i):
    time.sleep(10)
    return i**2
results = client.gather(client.map(my_slow_task, range(10)))

The problem is, the worker is keeping all results into its memory for sending them all at once. But it gets killed half way every time. So we end up after several tries with the following exception:

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-6-69e1e4e9d452> in <module>()
      1 futures = client.map(my_slow_task, range(10))
----> 2 results = client.gather(futures)
      3 results

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1560             return self.sync(self._gather, futures, errors=errors,
   1561                              direct=direct, local_worker=local_worker,
-> 1562                              asynchronous=asynchronous)
   1563 
   1564     @gen.coroutine

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    650             return future
    651         else:
--> 652             return sync(self.loop, func, *args, **kwargs)
    653 
    654     def __repr__(self):

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    273             e.wait(10)
    274     if error[0]:
--> 275         six.reraise(*error[0])
    276     else:
    277         return result[0]

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/utils.py in f()
    258             yield gen.moment
    259             thread_state.asynchronous = True
--> 260             result[0] = yield make_coro()
    261         except Exception as exc:
    262             error[0] = sys.exc_info()

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/asyncio/futures.py in result(self)
    292             self._tb_logger = None
    293         if self._exception is not None:
--> 294             raise self._exception
    295         return self._result
    296 

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/tornado/gen.py in run(self)
   1139                     if exc_info is not None:
   1140                         try:
-> 1141                             yielded = self.gen.throw(*exc_info)
   1142                         finally:
   1143                             # Break up a reference to itself

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1437                             six.reraise(type(exception),
   1438                                         exception,
-> 1439                                         traceback)
   1440                     if errors == 'skip':
   1441                         bad_keys.add(key)

/work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ('my_slow_task-f7d3b870b0218db300f66ed70fe3fba2', 'tcp://10.135.36.29:44953')
mrocklin commented 5 years ago

Ideally you would have a few workers that died at different times. In this way a "herd" of workers could keep the data alive, rather than having them all die at the same time. You might also consider gathering results as they complete with as_completed

On Mon, Aug 13, 2018 at 8:31 AM, Guillaume EB notifications@github.com wrote:

So i've tried that with and without adaptive. As expected, without adaptive, once the worker is killed, nothing happens anymore.

With adaptive scaling, in my setup, a worker is restarted every time the previous one dies. Cool, that's what I want.

However, I'm still running into a difficult issue. Here is the code I use for some background:

cluster.adapt(minimum=1, maximum=1) client = Client(cluster)import timedef my_slow_task(i): time.sleep(10) return i**2 results = client.gather(client.map(my_slow_task, range(10)))

The problem is, the worker is keeping all results into its memory for sending them all at once. But it gets killed half way every time. So we end up after several tries with the following exception:


KilledWorker Traceback (most recent call last) in () 1 futures = client.map(my_slow_task, range(10))----> 2 results = client.gather(futures) 3 results /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous) 1560 return self.sync(self._gather, futures, errors=errors, 1561 direct=direct, local_worker=local_worker,-> 1562 asynchronous=asynchronous) 1563 1564 @gen.coroutine /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/client.py in sync(self, func, *args, kwargs) 650 return future 651 else:--> 652 return sync(self.loop, func, *args, *kwargs) 653 654 def repr(self): /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/utils.py in sync(loop, func, args, kwargs) 273 e.wait(10) 274 if error[0]:--> 275 six.reraise(error[0]) 276 else: 277 return result[0] /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb)--> 693 raise value 694 finally: 695 value = None /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/utils.py in f() 258 yield gen.moment 259 thread_state.asynchronous = True--> 260 result[0] = yield make_coro() 261 except Exception as exc: 262 error[0] = sys.exc_info() /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/tornado/gen.py in run(self) 1131 1132 try:-> 1133 value = future.result() 1134 except Exception: 1135 self.had_exception = True /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/asyncio/futures.py in result(self) 292 self._tb_logger = None 293 if self._exception is not None:--> 294 raise self._exception 295 return self._result 296 /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/tornado/gen.py in run(self) 1139 if exc_info is not None: 1140 try:-> 1141 yielded = self.gen.throw(exc_info) 1142 finally: 1143 # Break up a reference to itself /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker) 1437 six.reraise(type(exception), 1438 exception,-> 1439 traceback) 1440 if errors == 'skip': 1441 bad_keys.add(key) /work/ADM/hpc/eynardbg/Outils/miniconda3/envs/dask_dev/lib/python3.5/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb)--> 693 raise value 694 finally: 695 value = None

KilledWorker: ('my_slow_task-f7d3b870b0218db300f66ed70fe3fba2', 'tcp://10.135.36.29:44953')

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/122#issuecomment-412501959, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLkYEVwI0ORIAnuFnto5twZBKvWBks5uQXGegaJpZM4V44B5 .

josephhardinee commented 5 years ago

Out of curiosity, how does that work? I assume when a new worker needs to be started, a new jobscript has to be submitted. I assume that would need to be handled by jobqueue and not distributed?

guillaumeeb commented 5 years ago

I don't exactly know how adaptive cluster is implemented in distributed, but the idea is that something is watching for dask cluster load, and calling scale method accordingly, for example if there are a lot more pending tasks than available workers.

Considering the issue here:

mrocklin commented 5 years ago

FWIW the adaptive implementation is only 300 lines of code. It might be a good evening read if anyone is interested:

https://github.com/dask/distributed/blob/master/distributed/deploy/adaptive.py

adaptive is already doing the first identified need: insuring there is always a minimum number of workers alive. Is it still worth implementing that as a dask-jonqueue default?

I don't know. @jhamman might have thoughts. Alternatively maybe there is some way to change Adaptive to maintain a set target rather than move that target around.

Second thing is that users should not submit graphs that cannot finish within the job allocated time and the requested number of workers. Submitting several graphs should be fine though (so doing batches if needed). Is this correct? Should we add this in the docs somehow?

It should be possible to run a single large graph with an evolving herd of workers, even if the workers don't live very long.

jhamman commented 5 years ago

Alternatively maybe there is some way to change Adaptive to maintain a set target rather than move that target around.

I think this is necessary to do this correctly. Ideally we could have a Cluster.target_workers property. This property would be updated by both Cluster.scale() and Adaptive. My recent attempts to improve the adaptive scaling in jobqueue in #97 have highlighted the need for this.

Another thing that would be worth discussing is the ability for Adaptive to query the Cluster for pending workers. This obviously comes up in jobqueue when there are jobs in a queue but I image this will also be useful to other dask deploy managers (dask-kubernetes, etc.).

wgustafson commented 5 years ago

Sounds like you have a pretty good idea of my workflow scenario and some initial ideas on what might be needed to enable it to work predictably. Thanks for looking into this.

From @guillaumeeb it looks like there might be something going on with NERSC's edison computer compared to his cluster since I was not seeing any restarts, whereas he was getting restarts but with data passing issues. If you would like me to run any tests on edison, let me know.

guillaumeeb commented 5 years ago

@wgustafson, it could still be useful if you could give us some simplified version of the code your trying to run, just to be sure.

For information, I've performed my tests with a dask-jobqueue installed from one of the latest commit in master branch, were you using this to, or did you just pip (or conda) installed the latest released package?

@jhamman just to be sure, appart from fixing some edge case issues detected with #97, you believe it will be a good idea to have automatic restart of finished jobs by default?

It should be possible to run a single large graph with an evolving herd of workers, even if the workers don't live very long.

@mrocklin from my recent experiment here, I cannot see how it could be possible for now. Are in memory data backed up or transmitted in some way between workers? The simple workflow of submitting a group of independent tasks seems to show that you need to have at least workers that last enough time to process nb_tasks / nb_worker tasks. But I did not try with several workers, so this is just a guess.

jhamman commented 5 years ago

@jhamman just to be sure, appart from fixing some edge case issues detected with #97, you believe it will be a good idea to have automatic restart of finished jobs by default?

I was thinking this would only work when using Adaptive. I think a non-adaptive job that dies due to a wall-clock expiration should stay dead.

guillaumeeb commented 5 years ago

Yep, I agree with you @jhamman, no reason to implement a more complex default behavior when most of the time this is not needed (and probably not what we want), and also when there is already a complex mechanism that answers the need.

So this works right now with Adaptive, and you've also opened #112 to fix the edge cases.

Another thing that would be worth discussing is the ability for Adaptive to query the Cluster for pending workers.

I let you open another issue for that part, and if you can provide some advantages it would have.

Now @wgustafson, lets try to solve your adaptive problem, and also to see if your workflow is compatible with dying and restarting workers.

josephhardinee commented 5 years ago

I assume to start a new worker would require a job script be submitted to the scheduler (On systems that are not exclusive allocations). This probably has to be handled by jobqueueand I'm wondering what the mechanism for transferring that information between distributed's Adaptive and jobqueue is? I didn't notice an explicit mechanism for this to happen.

wgustafson commented 5 years ago

I obtained dask-jobqueue via "conda install dask-jobqueue -c conda-forge" following the docs. The versions I have for my software stack are

I have started tinkering with a simplified script based on the my_slow_task script above to see how that behaves on edison. Hopefully, later today I will get time to verify it acts the same.

mrocklin commented 5 years ago

@josephhardinee

https://github.com/dask/dask-jobqueue/blob/4ea3c984be11984d047c9d38fdb81fb299d7530c/dask_jobqueue/core.py#L306-L338

wgustafson commented 5 years ago

OK, I've tested code similar to that from @guillaumeeb where he reported seeing the restarts happening. On edison, I do not get any restarted workers after the worker allocation times out. I tried using both the orginal client.gather approach as well as the as_completed approach suggested by @mrocklin.

As a check, I also tried using workers that lasted 10 minutes to make sure everything completes correctly when given enough time, and the long queue request succeeded. For the restart testing, I used 1 min. queue requests for the workers, and I tried with both 1 and 2 workers, but to no avail. Watching the SLURM queue, I could see the controlling job start, the worker(s) start, the workers complete, and then the controller just waiting until it times out.

Here is the python code for running dask:

from dask.distributed import Client, as_completed
from dask_jobqueue import SLURMCluster
import numpy as np
import time

def my_slow_task(i):
    print("i=", i)
    time.sleep(10)
    return i**2

# Start the cluster...
cluster = SLURMCluster(cores=1, processes=1, walltime="00:01:00")
print(cluster.job_script())
cluster.adapt(minimum=1, maximum=1)
client = Client(cluster)

# Do some work...
# Method 1 using client.gather
# results = client.gather(client.map(my_slow_task, range(20)))

# Method 2 using as_completed
ntasks = 20
feedback = np.empty(ntasks)
results = []
feedback = client.map(my_slow_task, range(ntasks))
for future in as_completed(feedback):
    results.append(future.result())

print(results)

And, here is the job submission script I used to submit the above code:

#!/bin/csh
#SBATCH --job-name=dasktest
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24  #edison full core count
#SBATCH --output=slurm.out
#SBATCH -p debug
cd /global/u1/d/d3m088/cmdv/code/testing_dask
ipython test_dask_restarts.py

The versions of my software are given in my earlier post today. If it helps, here is the full dump from dask-worker.err. The dask-worker.out file is empty and no output gets to slurm.out. Is the lack of file locking on this system causing the difference for edison?

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.128.14.25:40990'
distributed.diskutils - ERROR - Failed to clean up lingering worker directories in path: %s
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/diskutils.py", line 232, in new_work_dir
    self._purge_leftovers()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/diskutils.py", line 137, in _purge_leftovers
    lock.acquire()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 188, in acquire
    self._lock.acquire(self._timeout, self._retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 118, in acquire
    lock.acquire(timeout, retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 168, in acquire
    path=self._path,
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 100, in _acquire_non_blocking
    success = acquire()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 165, in <lambda>
    acquire=lambda: _lock_file_non_blocking(self._file),
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 56, in _lock_file_non_blocking
    fcntl.flock(file_.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 524] Unknown error 524
distributed.diskutils - ERROR - Could not acquire workspace lock on path: /global/u1/d/d3m088/cmdv/code/testing_dask/dask-worker-space/worker-sfxupu98.dirlock .Continuing without lock. This may result in workspaces not being cleaned up
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/diskutils.py", line 57, in __init__
    with workspace._global_lock():
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 194, in __enter__
    self.acquire()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 188, in acquire
    self._lock.acquire(self._timeout, self._retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 118, in acquire
    lock.acquire(timeout, retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 162, in acquire
    _lock_file_blocking(self._file)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 52, in _lock_file_blocking
    fcntl.flock(file_.fileno(), fcntl.LOCK_EX)
OSError: [Errno 524] Unknown error 524
distributed.worker - INFO -       Start worker at:   tcp://10.128.14.25:45709
distributed.worker - INFO -          Listening to:   tcp://10.128.14.25:45709
distributed.worker - INFO -              nanny at:         10.128.14.25:40990
distributed.worker - INFO -              bokeh at:          10.128.14.25:8789
distributed.worker - INFO - Waiting to connect to:   tcp://10.128.0.130:45360
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /global/u1/d/d3m088/cmdv/code/testing_dask/dask-worker-space/worker-sfxupu98
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://10.128.0.130:45360
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.14.25:40990'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
guillaumeeb commented 5 years ago

As mentioned in #126, I fear that adaptive mode is broken in release 0.3.0 of dask-jobqueue. It has latter been fixed by #63.

I would recommand trying master branch and see if that fixes this wrong behaviour.

guillaumeeb commented 5 years ago

I confirm that adaptive is broken with 0.3.0, I see the same behaviour as @wgustafson: no worker is restarted after the first one is stopped.

wgustafson commented 5 years ago

I did some tests and don't necessarily have good news to report.

I uninstalled dask-jobqueue v0.3.0 and installed the current master from this afternoon, reported as v0.3.0+31.g4ea3c98. I can confirm that new workers now start when old ones time out in the SLURM queue, which is what we wanted. However, I am now unable to get any results using the code above. The workers keep cycling beyond when the work should be completed and the scheduler task is not properly collecting the results and reporting them. Even when I request a short amount of work that should complete in a single worker's allocation, the workers keep cycling until the scheduler's allocation expires. Also, if I switch from cluster.adapt to cluster.scale I do not get a result.

My slurm.out file is reporting issues with workers already existing, as shown by the following dump. So, I'm not sure if the newest dask-jobqueue version is a step forward or backward for this particular workflow.

distributed.scheduler - ERROR - '10314564'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 61, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '10314564'
distributed.utils - ERROR - Worker already exists tcp://128.55.203.205:19379
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/utils.py", line 646, in log_errors
    yield
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1226, in add_worker
    raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://128.55.203.205:19379
distributed.core - ERROR - Worker already exists tcp://128.55.203.205:19379
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/core.py", line 320, in handle_comm
    result = yield result
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 315, in wrapper
    yielded = next(result)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1226, in add_worker
    raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://128.55.203.205:19379
distributed.scheduler - ERROR - '10314564'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker
    plugin.remove_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 73, in remove_worker
    del self.running_jobs[job_id][name]
KeyError: '10314564'
guillaumeeb commented 5 years ago

We've seen these KeyError exceptions before. This probably means that inside one job, your worker is terminated in error and automatically restarted on and on and on.

We have to find the root cause of worker failure, sometimes it is out of memory error, but I suspect it is not in your case...

guillaumeeb commented 5 years ago

In order to move forward here, @wgustafson could you provide the first error you see in the slurm.out file? We need to find the root cause of worker dying prematurely.

wgustafson commented 5 years ago

I re-ran the code to get the STDOUT and STDERR in slurm.out. This caused the IP addresses to change, so below I include both the slurm.out and dask-worker.err log output. The code and slurm submission script are from my previous comment on August 14, 2018 (above). Note that I am getting the print output from print(cluster.job_script()) but I never get the final results printed from the actual calculation, which should be transmitted back to the controlling process.

First, the contents of slurm.out. Sorry for the escape codes I haven't figured out how to get to output correctly yet.

Mon Aug 27 09:24:31 PDT 2018
distributed.scheduler - ERROR - '10512912'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 61, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '10512912'
distributed.utils - ERROR - Worker already exists tcp://128.55.203.68:46211
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/utils.py", line 646, in log_errors
    yield
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1226, in add_worker
    raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://128.55.203.68:46211
distributed.core - ERROR - Worker already exists tcp://128.55.203.68:46211
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/core.py", line 320, in handle_comm
    result = yield result
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 315, in wrapper
    yielded = next(result)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1226, in add_worker
    raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://128.55.203.68:46211
distributed.scheduler - ERROR - '10512912'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker
    plugin.remove_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 73, in remove_worker
    del self.running_jobs[job_id][name]
KeyError: '10512912'
distributed.scheduler - ERROR - '10512920'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 61, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '10512920'
distributed.scheduler - ERROR - '10512920'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker
    plugin.remove_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 73, in remove_worker
    del self.running_jobs[job_id][name]
KeyError: '10512920'
]0;IPython: code/testing_dask#!/bin/bash

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -e dask-worker.err
#SBATCH -o dask-worker.out
#SBATCH -p debug
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=60G
#SBATCH -t 00:01:00
JOB_ID=${SLURM_JOB_ID%;*}

/global/homes/d/d3m088/.conda/envs/bill/bin/python -m distributed.cli.dask_worker tcp://128.55.203.147:28265 --nthreads 1 --memory-limit 64.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
/global/u1/d/d3m088/cmdv/code/testing_dask/test_dask_restarts.py in <module>()
     32 feedback = client.map(my_slow_task, range(ntasks))
     33 for future in as_completed(feedback):
---> 34     print("result=",future.result())
     35     results.append(future.result())
     36 

~/.conda/envs/bill/lib/python3.6/site-packages/distributed/client.py in result(self, timeout)
    191                                   raiseit=False)
    192         if self.status == 'error':
--> 193             six.reraise(*result)
    194         elif self.status == 'cancelled':
    195             raise result

~/.conda/envs/bill/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ('my_slow_task-e10b2ccc4ae480dcfa7052ff39639fcd', 'tcp://128.55.203.68:46284')
Exception ignored in: <generator object add_client at 0x2aaacafe0360>
RuntimeError: generator ignored GeneratorExit
Future exception was never retrieved
future: <Future finished exception=CommClosedError('in <closed TCP>: Stream is closed',)>
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/comm/tcp.py", line 177, in read
    n_frames = yield stream.read_bytes(8)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/comm/tcp.py", line 198, in read
    convert_stream_closed_error(self, e)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
Future exception was never retrieved
future: <Future finished exception=CommClosedError('in <closed TCP>: Stream is closed',)>
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/comm/tcp.py", line 177, in read
    n_frames = yield stream.read_bytes(8)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/comm/tcp.py", line 198, in read
    convert_stream_closed_error(self, e)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed
Mon Aug 27 09:27:54 PDT 2018

And, dask-worker.err:

distributed.nanny - INFO -         Start Nanny at: 'tcp://128.55.203.68:45895'
distributed.diskutils - ERROR - Failed to clean up lingering worker directories in path: %s 
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/diskutils.py", line 232, in new_work_dir
    self._purge_leftovers()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/diskutils.py", line 137, in _purge_leftovers
    lock.acquire()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 188, in acquire
    self._lock.acquire(self._timeout, self._retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 118, in acquire
    lock.acquire(timeout, retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 168, in acquire
    path=self._path,
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 100, in _acquire_non_blocking
    success = acquire()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 165, in <lambda>
    acquire=lambda: _lock_file_non_blocking(self._file),
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 56, in _lock_file_non_blocking
    fcntl.flock(file_.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 524] Unknown error 524
distributed.diskutils - ERROR - Could not acquire workspace lock on path: /global/u1/d/d3m088/cmdv/code/testing_dask/dask-worker-space/worker-ca6aslbu.dirlock .Continuing without lock. This may result in workspaces not being cleaned up
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/diskutils.py", line 57, in __init__
    with workspace._global_lock():
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 194, in __enter__
    self.acquire()
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 188, in acquire
    self._lock.acquire(self._timeout, self._retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 118, in acquire
    lock.acquire(timeout, retry_period)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 162, in acquire
    _lock_file_blocking(self._file)
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/locket.py", line 52, in _lock_file_blocking
    fcntl.flock(file_.fileno(), fcntl.LOCK_EX)
OSError: [Errno 524] Unknown error 524
distributed.worker - INFO -       Start worker at:  tcp://128.55.203.68:46284
distributed.worker - INFO -          Listening to:  tcp://128.55.203.68:46284
distributed.worker - INFO -              nanny at:        128.55.203.68:45895
distributed.worker - INFO -              bokeh at:         128.55.203.68:8789
distributed.worker - INFO - Waiting to connect to: tcp://128.55.203.147:28265
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /global/u1/d/d3m088/cmdv/code/testing_dask/dask-worker-space/worker-ca6aslbu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://128.55.203.147:28265
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://128.55.203.147:28265
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://128.55.203.68:45895'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
guillaumeeb commented 5 years ago

The problem looks a bit like #139. Do you define a local-directory to use for you workers? They seem to try to lock shared folders. Could you try to set it when constructing your cluster? Common values are /tmp or /scratch.

wgustafson commented 5 years ago

I just tried manually stepping through the code and watching files get created on the disk system as I go. For my setup, I was using the default system temp drive, which is /tmp via $TMPDIR. I did not see anything get created in it. The directory containing the code was in my home directory, which is cross-mounted to the compute nodes on edison. So, it may not be possible to do file locking for it. The shared lustre system also does not have locking (I think). The home filesystem is where the dask-worker.err, dask-worker.out, and dask-worker-space/* get created.

If it matters, I also have my defaults in jobqueue.yaml set to not permit memory spilling. I am using

distributed:
  worker:
    memory:
      target: False  # Avoid spilling to disk
      spill: False  # Avoid spilling to disk
      pause: 0.80  # fraction at which we pause worker threads
      terminate: 0.95  # fraction at which we terminate the worker

I have noticed that the .err and .out files get into a race condition when I use multiple processes. I've watched them get overwritten as processes spin up, effectively wiping out output from previous processes. In the current tests I did not see this since I am only using one process (unless it happened when the adapt methodology spun up a new process and I missed it).

guillaumeeb commented 5 years ago

I was not clear on my above comment, it looks like your not defining local-directory in jobqueue.yaml conf file or using local_directory parameter in the SlurmCluster constructor.

Your workers are launched with

/global/homes/d/d3m088/.conda/envs/bill/bin/python -m distributed.cli.dask_worker tcp://128.55.203.147:28265 --nthreads 1 --memory-limit 64.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60

where there is no --local-directory option.

Use

cluster = SLURMCluster(cores=1, processes=1, walltime="00:01:00", local_directory='$TMPDIR')

if the environment variable TMPDIR is set. Else something like '/tmp'.

This may not be the root cause of your problem, but this will at least clean log files from fail locks or clean up errors.

guillaumeeb commented 5 years ago

I have noticed that the .err and .out files get into a race condition when I use multiple processes. I've watched them get overwritten as processes spin up, effectively wiping out output from previous processes

I've just noticed with #139 that in LSF and SlurmCluster template script, standard error and output files are configured to be written in job_name.err and job_name.out, so dask-worker.err and dask-worker.out with default settings.

This is probably not a good idea, as defining that using PBS for example would lead to file being overwritten every time a unitary job ends, this is probably the same with LSF or Slurm. With such default it could prove difficult to analyse full error logs.

You could probably try to override defaults using job_extra keyword:

cluster = SlurmCluster(..., job_extra=['-e dask-worker-%j.err', '-o dask-worker-%j.out'])
wgustafson commented 5 years ago

Today, I have been testing the code with the two changes based on your suggestions plus adding an extra timeout setting. This often works, and seems to be reliable for small requests that do not spill into a second process. However, when I scale up the problem to more calls to my_slow_task I start getting unreliable behavior. Often, the job never completes properly and I get no output. Sometimes this is accompanied by errors and other times not. I'm still tinkering to see if I can make sense of what is happening. Slow queue turnaround is making this a bit difficult.

Specifically, I changed

cluster = SLURMCluster(cores=1, processes=1, walltime="00:01:00")
client = Client(cluster)

to

cluster = SLURMCluster(cores=1, processes=1, walltime="00:01:00", local_directory="$TMPDIR", job_extra=['-e dask-worker-%j.err', '-o dask-worker-%j.out'])
client = Client(cluster, timeout='60s')

I tested on edison with both the scale and adapt methodologies. The good news is that the file locking issue no longer comes up, which is nice. Also, I can now see output from all the spawned processes.

guillaumeeb commented 5 years ago

Glad to here we were able to address some of the issues here.

However, when I scale up the problem to more calls to my_slow_task I start getting unreliable behavior

Do this work sometimes? I would think not. Considering the use case you've given above:

def my_slow_task(i):
    print("i=", i)
    time.sleep(10)
    return i**2

# Start the cluster...
cluster = SLURMCluster(cores=1, processes=1, walltime="00:01:00", local_directory="$TMPDIR", job_extra=['-e dask-worker-%j.err', '-o dask-worker-%j.out'])
cluster.adapt(minimum=1, maximum=1)
client = Client(cluster, timeout='60s')

# Do some work...
results = client.gather(client.map(my_slow_task, range(20)))

My personnal experience is that you won't be able to submit more than 5 tasks at once to Dask. Maybe batching and doing multiple calls to client.map would work, not sure. See comment above https://github.com/dask/dask-jobqueue/issues/122#issuecomment-412501959, and my conclusion:

users should not submit graphs that cannot finish within the job allocated time and the requested number of workers. Submitting several graphs should be fine though (so doing batches if needed).

@mrocklin this would be nice to confirm this, and maybe give some advice here?

wgustafson commented 5 years ago

Most of the testing I did that I remember working for my_slow_task was with 5 calls to the subroutine via client.map. The testing yesterday with 20 and 80 was failing. Over the day today, I'll try to submit some different sizes and be more systematic about documenting the behavior.

I have more complicated code that makes matplotlib plots in each call to the subroutine, and then a different version that subsets data files with each call. In those codes, dask has worked up to 109 calls. However, I used a slightly different calling method. Instead of client.map I used something akin to

frames = np.arange(0, 109, dtype=np.int)
results = [dask.delayed(do_plotting)(iframe) for iframe in frames]

I think I also was using scale instead of adapt. But again, the method that worked last week was failing yesterday (albeit with modified code for new purposes, so I might have change something else too that caused issues). My impression is that the behavior might be dependent on the business of the network or something like that.

guillaumeeb commented 5 years ago

I think this is not related to dask-jobqueue anymore, and that this should be handled upstream.

wgustafson commented 5 years ago

Following up on yesterday's post, I tested the code and was able to get it to work with 5, 10, and 15 calls to the subroutine. But, it would always fail with 20 calls. So, I can corroborate your experience. I tried 3 times throughout the day to see if the machine state matters. This feels like some sort of timeout setting that needs to be adjusted, but I am not familiar enough with dask to know what to tinker with.

For the record, instead of using client.gather(client.map()), I used the following calling sequence:

results = []
feedback = client.map(my_slow_task, range(ntasks))
for future in as_completed(feedback):
    results.append(future.result())

If this is no longer the right place for this issue, we can close this thread.

guillaumeeb commented 5 years ago

This is very interesting, I have not used as_completed yet, but it could allow you to gather results as soon as they are computed, and so not being forced having the computation end in one workers allocation.

We could wait for an external perspective or open an issue upstream in distributed tracker, as you wish, but I cannot help anymore.

guillaumeeb commented 5 years ago

@mrocklin it would be good if you could take a look at this one, more precisely the last 3 or 4 comments.

mrocklin commented 5 years ago
results = []
feedback = client.map(my_slow_task, range(ntasks))
for future in as_completed(feedback):
    results.append(future.result())

You might consider avoiding having a reference to the futures in feedback

results = []
for future in as_completed(client.map(my_slow_task, range(ntasks))):
    results.append(future.result())

This means that as futures are completed you collect their result, and then they go out of scope. Once they're out of scope there is no need to recompute them if the worker that they were computed on dies.

For as long as you have the futures in the feedback list in your local scope, Dask will try to make sure that they are in memory. If the worker that they are on dies unexpectedly then Dask will try to recompute them elsewhere.

Another solution to this problem would be to anticipate the worker's timeout and actively retire the worker just before it gets killed. You can do this with Scheduler.retire_workers(workers=[worker_address]). If you politely retire a worker then it will move its result to another worker so that that task does not have to be recomputed.

I haven't read this entire thread (only the last few comments). @guillaumeeb please let me know if there are other things that I've missed that are important, or if the answer above can be clarified.

guillaumeeb commented 5 years ago

Thanks @mrocklin, this is perfect, @wgustafson if you have time to try the solution mentionned above this would be great.

wgustafson commented 5 years ago

I just tried swapping the lines of code to remove holding the futures in the feedback array. Like before, it worked with 10 tasks but failed with 20 tasks. I am still getting KeyError messages for the worker's job ID number.

The suggestion from @mrocklin is related to workers dying and then having to resurrect work elsewhere. I do not think the problem with the KeyError is related to this since, in this case, the worker is actually still running after the controlling process fails and ends. With 20 tasks, I can watch the controller spawn the worker job in the queue, then the controller ends, and the worker is still present. If I only do 10 tasks, the controller and worker end essentially simultaneously since it is a clean shutdown.

For the record, the errors in my SLURM output is:

distributed.scheduler - ERROR - '10671256'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1267, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 61, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '10671256'
distributed.scheduler - ERROR - '10671256'
Traceback (most recent call last):
  File "/global/homes/d/d3m088/.conda/envs/bill/lib/python3.6/site-packages/distributed/scheduler.py", line 1662, in remove_worker
    plugin.remove_worker(scheduler=self, worker=address)
  File "/global/u1/d/d3m088/python/dask-jobqueue/dask_jobqueue/core.py", line 73, in remove_worker
    del self.running_jobs[job_id][name]
KeyError: '10671256'
guillaumeeb commented 5 years ago

I am still getting KeyError messages for the worker's job ID number.

If you're still getting those errors, then I suspect your worker processes are still getting killed and restarted unexepectedly. We have to find the cause of this restart, is there something in the worker logs that tells that a worker has been killed or has erred? Could you once again print here the logs from a restarting worker?

Are we still talking about the really simple example above, with a each task doing mainly nothing? Looking at this I really don't see what could cause a worker to restart...

guillaumeeb commented 5 years ago

Hi @wgustafson, any update on this one?

KeyError logs should be fixed by #138. I'm still not sure about what the problem here is.

jni commented 5 years ago

@mrocklin

You might consider avoiding having a reference to the futures in feedback

:boom: I've been trying to figure this out all day. So obvious now. Thanks!

Q: is there another way of achieving the same result while still keeping the reference to the list? Purely for coding style purposes. Also, in the past I've had issues with garbage collection not quite letting go of some objects so I suspect being explicit here might be helpful. ie will the following be equivalent?

results = []
feedback = client.map(my_slow_task, range(ntasks))
for future in as_completed(feedback):
    results.append(future.result())
    future.cancel()

Another solution to this problem would be to anticipate the worker's timeout and actively retire the worker just before it gets killed

This feels like something in-scope for dask-jobqueue to handle automatically, right? I can of course imagine many things going wrong with this so we might want it to be opt-in for a start, but something along the lines of auto_retire_time='5m'?

jni commented 5 years ago

And btw thanks everyone for all the parts of this discussion, it has been extremely helpful as I try to figure out dask-jobqueue's behaviour with my tasks/cluster.

jni commented 5 years ago

Ok so there's still a bit of a race condition with the current implementation:

client = Client(cluster)

X = np.empty((len(filenames), len(feature_names)), dtype=np.float32)
for future in tqdm(as_completed(client.map(features, enumerate(filenames))),
                   desc='features', total=len(X)):
    i, v = future.result()
    X[i, :] = v

results in:

features:  11%|███████▍                                                             | 5879/54528 [7:27:32<12:37:14,  1.07it/s]
---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
~/projects/microscopium-scripts/feature-comp-vi.py in <module>
     79 for future in tqdm(as_completed(client.map(features, enumerate(filenames))),
     80                    desc='features', total=len(X)):
---> 81     i, v = future.result()
     82     X[i, :] = v
     83

~/.conda/envs/mic/lib/python3.6/site-packages/distributed/client.py in result(self, timeout)
    193                                   raiseit=False)
    194         if self.status == 'error':
--> 195             six.reraise(*result)
    196         elif self.status == 'cancelled':
    197             raise result

~/.conda/envs/mic/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ('features-e0a14d80d2b5e66c9e270a60d2afd55d', 'tcp://172.16.193.80:45747')

I assume that this is because the worker gets killed in between as_completed finding the finished job and .result() being called on it. My proposed fix that I'll try now is:

client = Client(cluster)

X = np.empty((len(filenames), len(feature_names)), dtype=np.float32)
for future in tqdm(as_completed(client.map(features, enumerate(filenames))),
                   desc='features', total=len(X)):
    try:
        i, v = future.result()
    except KilledWorker:
        client.retry([future])
        i, v = future.result()
    X[i, :] = v

Is this the best way to do this? Would as_completed(..., with_result=True) get around the issue (ie do the retrying under the hood) or would it potentially get the same error?

guillaumeeb commented 5 years ago

Thanks @jni for your nice comment and your interest.

Your use case is a bit complicated to fully apprehend to me right now. I would recommend to open a dedicated issue, maybe directly upstream in distributed (but this is maybe linked to dask-jobqueue jobs' walltime, so in this case stay here). It would also be good if you could workout a minimalist reproducible example so we can help more effectively.

You could also add in the new issue any new experiments you may have done in the last couple of days.

Does it sound good to you?

jni commented 5 years ago

@guillaumeeb this seems to be the right issue? It's not distributed's job to know when workers will die.

My code is embarrassingly parallel: I am computing a feature vector from each of ~50K images, then putting those results into a NumPy array and finally into a pandas DataFrame to save. I believe this is a minimal (modulo the duration/resources, which can probably be reduced!) example matching my use case:

import time
import numpy as np
from dask_jobqueue import SLURMCluster as Cluster
from dask import delayed
from dask.distributed import Client, as_completed

# config in /home/jnun0003/.config/dask/jobqueue.yaml
cluster = Cluster(walltime='01:00:00', cores=1, memory='4gb', job_cpu=1,
                  job_mem='4gb')
cluster.adapt(minimum=0, maximum=48, target_duration='4h')
client = Client(cluster)

filenames = [f'img{num}.jpg' for num in range(50000)]

def features(num_fn):
    num, image_fn = num_fn
    # features = compute_features(image_fn)  # my own custom code, equivalent code below
    time.sleep(5)  # takes about 5s to compute features on an image
    features = np.random.random(246)
    return num, features

num_files = len(filenames)
num_features = len(features(filenames[0])[1])

X = np.empty((num_files, num_features), dtype=np.float32)

for future in as_completed(client.map(features, enumerate(filenames))):
    i, v = future.result()
    X[i, :] = v

features = pd.DataFrame(X, columns=[f'feat{i}' for i in range(X.shape[1])])
features['filenames'] = filenames

features.to_csv('data.csv', index=False)

With jobqueue 0.4.1, this ends up with a KilledWorker error. Which I think is what this issue is dealing with? Anyway, some workers will be killed, since the total feature computation time exceeds 48h, the maximum compute time of the initial worker pool. And for me this is a primary reason to use dask-jobqueue, which is that one doesn't need to know ahead of time how long their computation will take, if dask-jobqueue can take care of spawning workers as needed.

So I have two questions, (1) what is the best way to deal with this gracefully now, and (2) where can I make a contribution so that the code above would work transparently, or at least this equivalent for-loop:

for future, (i, v) in as_completed(client.map(features, enumerate(filenames)), with_result=True):
    X[i, :] = v

I haven't yet completed the necessary experiments, but I have two options. (a) as above, retry it. But I think that actually this also has a race condition problem:

client.retry([future])  # future gets submitted to near-death worker
# worker gets killed
i, v = future.result()  # KilledWorker error again

So now I'm thinking of:

work_queue = as_completed(client.map(features, enumerate(filenames)))
for future in work_queue:
    try:
        i, v = future.result()
    except KilledWorker:
        client.retry([future])
        work_queue.add(future)
    else:
        X[i, :] = v
jni commented 5 years ago

Ok so I tried the last option, and it worked like a charm! I wrapped it in a tqdm progress bar but here's the output:

features:  50%|██████████████████████████████████▍                                  | 27183/54528 [1:02:34<1:02:27,  7.30it/s]distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://172.16.193.188:41686'], features-55f478c8aca9dcbd97f98cdeb1d49b36
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'features-55f478c8aca9dcbd97f98cdeb1d49b36': ('tcp://172.16.193.188:41686',)}
features:  99%|██████████████████████████████████████████████████████████████████████▎| 54046/54528 [2:04:04<01:36,  4.99it/s]distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://172.16.193.190:39647'], features-15f2d90c870cd1ef0dc775bbf90374d1
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'features-15f2d90c870cd1ef0dc775bbf90374d1': ('tcp://172.16.193.190:39647',)}
features: 100%|███████████████████████████████████████████████████████████████████████| 54528/54528 [2:05:46<00:00,  7.23it/s]
features computed in 2.0h6.0m22.0s

Which is brilliant! I don't know what the warnings are, but they don't seem to have affected the result. Oh and there's a future.cancel() in the else: clause which probably helps.

lesteve commented 5 years ago

Great to hear that you got it working!

I very naively assumed this is the kind of thing that "just works" (without having to implement the retry logic by hand) because of the built-in fault tolerance, i.e. the scheduler detects that the worker has died so it can send the job to be recomputed to another worker.

It would be great to have a stand-alone example reproducing the problem either way! In particular what's not clear to me is whether this is a problem in dask-jobqueue or in distributed (on a single-machine, you could imagine that a worker get kill -9 or something so that a similar situation could happen) and what could be done to improve the situation, in particular to avoid that the user has to implement the retry logic himself.

guillaumeeb commented 5 years ago

I very naively assumed this is the kind of thing that "just works" (without having to implement the retry logic by hand) because of the built-in fault tolerance, i.e. the scheduler detects that the worker has died so it can send the job to be recomputed to another worker.

What happens here may be that the tasks are restarted up to three times, but 48 workers must be dying almost at the same time, so the task are rescheduled on dying workers.

Or as @jni says, it may be

that this is because the worker gets killed in between as_completed finding the finished job and .result() being called on it.

Not sure. This may also come from as_completed...

I think we can simplify @jni example so that it is more easily reproducible (decrease walltime for example, and number of tasks). I totally agree with @lesteve, user should not have to do the retry by hand.

Some questions out of curiosity:

jni commented 5 years ago

@guillaumeeb

Why are you using only one core per job?

My cluster is quite oversubscribed so I wanted to produce jobs as small as possible to maximise my chances of getting scheduled. I was also unclear about the effect of number of threads vs number of cores... Is this using multiprocessing vs threading as a backend?

Could you log some information to see how many times you get the KilledWorker exception?

Yes, I'll work on these experiments in the next few days. Thanks for your support and your patience!