dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

ConnectionRefusedError #614

Open mens-artis opened 1 year ago

mens-artis commented 1 year ago
2023-09-14 14:24:39,905 - distributed.core - INFO - Starting established connection to tcp://...166.214:46011
slurmstepd-dlcgpu16: error: *** JOB 9277005 ON dlcgpu16 CANCELLED AT 2023-09-14T14:24:41 ***
2023-09-14 14:24:41,045 - distributed.worker - INFO - Stopping worker at tcp://...166.176:40901. Reason: scheduler-close
2023-09-14 14:24:41,046 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...5.166.214:46011>
Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 316, in write
    raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 327, in write
    convert_stream_closed_error(self, e)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...166.214:46011>: Stream is closed
2023-09-14 14:24:41,051 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://...166.176:33175'. Reason: scheduler-close
2023-09-14 14:24:41,053 - distributed.core - INFO - Received 'close-stream' from tcp://...166.214:46011; closing.
2023-09-14 14:24:41,053 - distributed.nanny - INFO - Worker closed

I had inserted the following code at the top of submit_trial(), to avoid a timeout from the scheduler. This may be quite central because apparently SMAC3 expects the schedulere to launch the compute nodes instantly:

import asyncio

and

try:
    self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
    logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
                  "Try increasing `worker_timeout`.")
    raise error
mens-artis commented 1 year ago

It seems that the program just ends with an error, like, one end of the TCP connection continuing after the job is finished. This does cause a problem, notably that my GPUinfo cannot be run but it seems to be a side effect of the disconnection. Or, even a different problem. So I guess I will delete it here. I don't have time to study it further and taking a different approach.

guillaumeeb commented 1 year ago

Hi @mens-artis,

I'm not sure about the context here, you are talking about SMAC3 or GPUinfo which I don't know about.

Anyway, yes, there are sometimes Error messages when shutting down a Dask cluster, but as you've noticed, this does not cause a problem for your computation.