cirruslabs / orchard

Orchestrator for running Tart Virtual Machines on a cluster of Apple Silicon devices
Other
200 stars 17 forks source link

Curl hang on querying VM IP #218

Closed eecsmap closed 2 weeks ago

eecsmap commented 2 weeks ago

Recently I am working on a service heavily relying on the API /vms/{vmname}/ip. Yet I have run into couple of times that the query simply hangs.

user@host-01 ~ % oc -v  https://orchard.mycompany.net/v1/vms/testvm/ip
* Host orchard.mycompany.net:443 was resolved.
* IPv6: (none)
* IPv4: 10.0.0.11
*   Trying 10.0.0.11:443...
* Connected to orchard.mycompany.net (10.0.0.11) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 / [blank] / UNDEF
* ALPN: server did not agree on a protocol. Uses default.
* Server certificate:
*  subject: C=US; ST=California; L=Palo Alto; O=MyCompany Inc.; CN=*.mycompany.net
*  start date: Apr 10 00:00:00 2024 GMT
*  expire date: Apr 10 23:59:59 2025 GMT
*  subjectAltName: host "orchard.mycompany.net" matched cert's "*.mycompany.net"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS RSA SHA256 2020 CA1
*  SSL certificate verify ok.
* using HTTP/1.x
* Server auth using Basic with user 'orchard-root'
> GET /v1/vms/testvm/ip HTTP/1.1
> Host: orchard.mycompany.net
> Authorization: Basic xxxxxxxxxxxxxxxxxxx
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off

Then it hangs.

This issue is not observed on my prod environment. Yet happens in my test environment with a large chance. Any suggestion?

eecsmap commented 2 weeks ago

It seems hang on waiting for https://github.com/cirruslabs/orchard/blob/main/internal/controller/api_vms_ip.go#L58

edigaryev commented 2 weeks ago

Hello!

  1. Are these hangs reproducible for a short period of time after a single hang occurs, or they're completely stray with only a single one occurring in a minute/hour/day timeframe?

  2. Are you using --net-bridged? Using --net-bridged causes Orchard to pass --resolver=arp to the tart ip command, which in turn relies on the VM being active on the local network to populate the host's ARP table.

    Most Linux VMs don't do that out-of-the box and this happens sometimes with long-running macOS VMs.

  3. The next time you observe this, could you please verify that tart ip orchard-<VM-name>-<UUID> (or tart ip --resolver=arp orchard-<VM-name>-<UUID> when using --net-bridged) works on the node with your VM at the time this happens?

eecsmap commented 2 weeks ago

When I run tart ip <vmname> I got no IP address found When I run tart ip <vmname> --resolver=arp I also got no IP address found

I double checked with the DHCP server and it actually allocate the IP for the MAC of the VM. I can ssh into the VM using this IP.

eecsmap commented 2 weeks ago

I think I got the problem fixed. Thanks!