automl / SMAC3

SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
https://automl.github.io/SMAC3/v2.1.0/
Other
1.04k stars 220 forks source link

tcp communication error CancelledError #1073

Open mens-artis opened 9 months ago

mens-artis commented 9 months ago

I run the basics example 7. It seems that there is a TCP Error at the end (that is expected), when the receiving end of the connection hangs up. This may have been introduced by the fix for [WARNING][dask_runner.py:135] No workers are available. This could mean workers crashed., i.e.

try:
    self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
    logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
                  "Try increasing `worker_timeout`.")
    raise error

The problem is that this may obscure real problems.

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 491, in connect
    stream = await self.client.connect(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py", line 282, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 343, in connect
    comm = await wait_for(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 1920, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/home/username/.python3.10.6/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/SMAC3_wrapped/example7.py", line 119, in <module>
    incumbent = smac.optimize()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
    incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
    self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 135, in submit_trial
    if self.count_available_workers() <= 0:
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 182, in count_available_workers
    return sum(self._client.nthreads().values()) - len(self._pending_trials)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/client.py", line 3919, in nthreads
    return self.sync(self.scheduler.ncores, workers=workers, **kwargs)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
    return sync(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 426, in sync
    raise exc.with_traceback(tb)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/utils.py", line 399, in f
    result = yield future
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1626, in connect
    return connect_attempt.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/core.py", line 1516, in _connect
    comm = await connect(
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py", line 369, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://....166.214:34417 after 30 s
msg: {'status': 'OK', 'time': 1694708544.0040095, 'heartbeat-interval': 0.5}
msg: {'status': 'OK', 'time': 1694708550.0110245, 'heartbeat-interval': 0.5}

With or without the sleep(6), there is a disconnect.

err:

cat tmp/smac_dask_slurm/dask-worker-9277642.err
cpu-bind=MASK - dlcgpu26, task  0  0 [25985]: mask 0x8000000080 set
2023-09-14 18:53:42,082 - distributed.nanny - INFO -         Start Nanny at: 'tcp://....166.186:34089'
2023-09-14 18:53:42,662 - distributed.worker - INFO -       Start worker at:   tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO -          Listening to:   tcp://....166.186:38039
2023-09-14 18:53:42,662 - distributed.worker - INFO -           Worker name:             SLURMCluster-3
2023-09-14 18:53:42,662 - distributed.worker - INFO -          dashboard at:         ....166.186:46501
2023-09-14 18:53:42,662 - distributed.worker - INFO - Waiting to connect to:   tcp://....166.214:33319
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,662 - distributed.worker - INFO -               Threads:                          1
2023-09-14 18:53:42,662 - distributed.worker - INFO -                Memory:                   0.93 GiB
2023-09-14 18:53:42,662 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-ls7s1wou
2023-09-14 18:53:42,662 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,669 - distributed.worker - INFO - Starting Worker plugin shuffle
2023-09-14 18:53:42,670 - distributed.worker - INFO -         Registered to:   tcp://....166.214:33319
2023-09-14 18:53:42,670 - distributed.worker - INFO - -------------------------------------------------
2023-09-14 18:53:42,670 - distributed.core - INFO - Starting established connection to tcp://....166.214:33319
slurmstepd-dlcgpu26: error: *** JOB 9277642 ON dlcgpu26 CANCELLED AT 2023-09-14T18:54:25 ***
2023-09-14 18:54:25,422 - distributed.worker - INFO - Stopping worker at tcp://....166.186:38039. Reason: scheduler-close
2023-09-14 18:54:25,423 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>
Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 317, in write
    raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 328, in write
    convert_stream_closed_error(self, e)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://....166.186:54362 remote=tcp://....166.214:33319>: Stream is closed
2023-09-14 18:54:25,431 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://....166.186:34089'. Reason: scheduler-close
2023-09-14 18:54:25,442 - distributed.core - INFO - Received 'close-stream' from tcp://....166.214:33319; closing.
2023-09-14 18:54:25,443 - distributed.nanny - INFO - Worker closed

out

(venv3106) username@node1:$ cat tmp/smac_dask_slurm/dask-worker-9277642.out
/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/core.py : trying for another 29.999988079071045 seconds
/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/tcpclient.py-- timeout = None
msg: {'compression': None, 'python': (3, 10, 6), 'pickle-protocol': 5}
msg: {'op': 'register_nanny', 'address': 'tcp://....166.186:34089'}
mens-artis commented 9 months ago

I wonder if it was always based on Dask.

alexandertornede commented 8 months ago

Once again, thanks for reporting this. We will look into it.