dstackai / dstack

dstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.
https://dstack.ai
Mozilla Public License 2.0
1.18k stars 87 forks source link

The `Can't connect to the remote host` error doesn't include the actual error exception #1242

Open peterschmidt85 opened 1 month ago

peterschmidt85 commented 1 month ago

Steps to reproduce:

  1. Make a wrong SSH config on your machine
  2. Run dstack run (a dev environment or a task with ports)

Actual:

  1. It prints Can't connect to the remote host (without the actual error exception information)
File "/Users/chansung/miniconda3/envs/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/attach.py", line 136, in attach
    raise SSHError "Can't connect to the remote host" )
dstack._internal.core.errors.SSHError: Can't connect to the remote host

Expected:

  1. It includes the SSH connection exception

Example:

Traceback (most recent call last):
  File "/Users/chansung/miniconda3/envs/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/attach.py", line 129, in attach
    self.tunnel.open()
  File "/Users/chansung/miniconda3/envs/dstack/lib/python3.11/site-packages/dstack/_internal/core/services/ssh/tunnel.py", line 64, in open
    raise get_ssh_error(error)
dstack._internal.core.errors.SSHError: kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

-- Notes:

def attach(self):
        include_ssh_config(self.ssh_config_path)
        if self.container_config is None:
            update_ssh_config(self.ssh_config_path, self.run_name, self.host_config)
        elif self.ssh_proxy is not None:
            update_ssh_config(self.ssh_config_path, f"{self.run_name}-jump-host", self.host_config)
            update_ssh_config(self.ssh_config_path, self.run_name, self.container_config)
        else:
            update_ssh_config(self.ssh_config_path, f"{self.run_name}-host", self.host_config)
            update_ssh_config(self.ssh_config_path, self.run_name, self.container_config)

        max_retries = 10
        self._ports_lock.release()
        for i in range(max_retries):
            try:
                self.tunnel.open()
                atexit.register(self.detach)
                break
            except SSHError:
                if i < max_retries - 1:
                    time.sleep(1)
        else:
            self.detach()
            raise SSHError("Can't connect to the remote host")

We must include the last SSHError from except SSHError to raise.

peterschmidt85 commented 1 week ago

This issue is stale because it has been open for 30 days with no activity.