dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.49k stars 153 forks source link

Unclear error message `Gateway is not working` if DNS is misconfigured #1058

Closed jvstme closed 4 months ago

jvstme commented 7 months ago

Steps to reproduce

  1. Start dstack server with ZeroSSL configured as the CA for dstack-gateway. See this comment.
  2. Create a gateway
    dstack gateway create --domain $DOMAIN --region eu-central-1 --backend aws
  3. Set a DNS A record for *.$DOMAIN, but instead of pointing it to the gateway's IP address point it to an IP address of some other machine that is down. As if you redeployed the gateway, but forgot to change the DNS record.
  4. Try running any service with dstack

    > cat drope.yml
    type: service
    
    commands:
      - pip install drope
      - drope
    port: 8000
    
    > dstack run . -f drope.yml 
    ... (redacted for brevity) ...
     Shown 3 of 761 offers, $49.159 max
    
    Continue? [y/n]: y

Expected behaviour

The CLI shows an error saying that dstack-gateway failed to issue a certificate for the service's domain and suggests the user to make sure the DNS A record points to the domain.

Actual behaviour

After 30 seconds the CLI shows an unclear error message.

Gateway is not working: 

The server logs don't have anything relevant.

dstack version

0.17.0

Server logs

No response

Additional information

What happens is:

This behavior depends on the CA. E.g. with Let's Encrypt certbot exits quickly and the error is passed to dstack server and then to the CLI.

GatewayError: Certbot failed:
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Some challenges have failed.

I suggest we fix this by adding a timeout to certbot runs and passing a clear error message to the CLI if the timeout is reached.

peterschmidt85 commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

jvstme commented 5 months ago

Still relevant

peterschmidt85 commented 4 months ago

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.