Healthy worker resulting in "no worker registered with this name" error on the controller

ruimarinho commented 12 months ago

Hi @fkorotkov,

Under certain circumstances, which I've been unable to replicate so far, the controller is receiving the PUT to update a worker's status, but VNC fails to work.

…
[GIN] 2023/09/12 - 14:08:30 | 200 |    1.201917ms |   10.128.156.54 | PUT      "/v1/workers/mac-1"
…

❯ orchard list workers
Name        Last seen       Scheduling paused
mac-1       4 seconds ago   false

❯ orchard list vms
Name        Created         Image                                           Status  Restart policy
ventura     26 minutes ago  ghcr.io/cirruslabs/macos-ventura-xcode:14.3.1   running Never (0 restarts)

❯ orchard vnc vm ventura
forwarding 127.0.0.1:62698 -> 6160344636:5900...
no credentials specified or found, trying default admin:admin credentials...
opening vnc://admin@127.0.0.1:62698...
failed to forward port: failed to WebSocket dial: expected handshake response status code 101 but got 503

On the controller, here's the error log:

{"level":"warn","ts":1694527777.3468726,"caller":"controller/api_vms_portforward.go:66","msg":"failed to request port-forwarding from the worker mac-1: no worker registered with this name: mac-1"}

edigaryev commented 12 months ago

I think this is similar to https://github.com/cirruslabs/orchard/issues/118: a worker should keep a persistent gRPC connection with the Orchard Controller (and re-try if it fails), which is not the case for some reason.

Are there any warnings or errors in the worker's log (e.g. failed to watch RPC: [...])?

ruimarinho commented 12 months ago

Hmm yep, just checked and ocasionally there are some hiccups probably due to the VPN connection of the physical machines.

failed to watch RPC: rpc error: code = Unavailable desc = error reading from server: EOF

The worker still reports as available and is able to pick up work, so it somehow is able to re-register, just not cleanly.

ruimarinho commented 11 months ago

Since this is a gRPC long-lived connection, I don't think I will be able to workaround the potential connection resets on this uplink. Can you reopen/retry the gRPC connection if it drops for some reason?

edigaryev commented 11 months ago

Please consider upgrading to a new new 0.13.0 release and running the controller with a --debug flag.

With this flag enabled, the controller will emit a new log line each time the worker connecting from the gRPC is registered and de-registered.

A worker that is not registered cannot be used for port-forwarding (including VNC and SSH commands).

It also makes sense to upgrade the worker, but it doesn't need a --debug flag enabled. It will provide more gRPC-related by default.

Hopefully this will help in diagnosing the root cause.

edigaryev commented 11 months ago

Also, is there a possibility that you have multiple workers with the same name connecting to the Orchard Controller?

Especially if this happens from the same machine.

If so, that might be the root cause.

ruimarinho commented 11 months ago

Please consider upgrading to a new new 0.13.0 release and running the controller with a --debug flag.

Just saw it — will do. On both the server and the client. Is there a preferred upgrade path? First the controller and then the workers?

Also, is there a possibility that you have multiple workers with the same name connecting to the Orchard Controller?

Not possible. All hostnames are unique.

edigaryev commented 11 months ago

Is there a preferred upgrade path? First the controller and then the workers?

Currently it doesn't matter as long as you're not going to use the new options (for example, support for sharing files with the host system) straight away (before you've upgraded the whole cluster).

Not possible. All hostnames are unique.

OK, the logs will tell then.

dennispost commented 9 months ago

Hey @edigaryev

I'm getting these logs:

Controller:

[GIN] 2023/11/18 - 02:47:33 | 200 | 101.763µs | _my_worker_ip_ | GET "/v1/vms/test"
{"level":"warn","ts":1700275653.9226034,"caller":"controller/api_vms_portforward.go:76","msg":"failed to request port-forwarding from the worker _my_worker_name_.local: no worker registered with this name: _my_worker_name_.local"}
[GIN] 2023/11/18 - 02:47:33 | 503 | 149.564µs | worker_ip | GET "/v1/vms/test/port-forward?port=5900&wait=60"

Worker:

{"level":"info","ts":1700275651.204858,"caller":"worker/worker.go:199","msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1700275656.158527,"caller":"worker/rpc.go:24","msg":"connecting to my_controller_domain:443 over gRPC"}
{"level":"info","ts":1700275656.159204,"caller":"worker/rpc.go:36","msg":"gRPC connection established, starting gRPC stream with the controller"}
{"level":"info","ts":1700275656.1832829,"caller":"worker/rpc.go:47","msg":"running gRPC stream with the controller"}
{"level":"warn","ts":1700275656.193599,"caller":"worker/worker.go:114","msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"info","ts":1700275656.20684,"caller":"worker/worker.go:199","msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1700275661.201838,"caller":"worker/worker.go:199","msg":"syncing 1 local VMs against 1 remote VMs..."}

Client:

orchard ssh vm test
ssh command failed: failed to setup port-forwarding to the VM "test": failed to WebSocket dial: expected handshake response status code 101 but got 503

edigaryev commented 9 months ago

{"level":"warn","ts":1700275656.193599,"caller":"worker/worker.go:114","msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}

Assuming that you're using an ingress (https://github.com/cirruslabs/orchard/issues/148#issuecomment-1817298713), it looks like it was improperly configured to proxy the gRPC connections.

See https://github.com/cirruslabs/orchard/issues/102 (and the issues mentioned in the first post) for more details.

edigaryev commented 4 months ago

Closing because of no reply.

cirruslabs / orchard

Healthy worker resulting in "no worker registered with this name" error on the controller #126