cirruslabs / orchard

Orchestrator for running Tart Virtual Machines on a cluster of Apple Silicon devices
194 stars 16 forks source link

Certificate-less bootstrap tokens #93

Closed edigaryev closed 1 year ago

edigaryev commented 1 year ago

In #86, Orchard was starting to create certificate-less contexts for Controllers that are using PKI-compatible certificates.

However, I've overlooked the fact the we also need to add the certificate-less support to the bootstrap tokens.


ruimarinho commented 1 year ago

Great work @edigaryev - the worker has now been able to re-register. I did a quick test and everything seems to be working so far, but there is a recurring message around a 400 error:

orchard@mac % sudo launchctl load -w /Library/LaunchDaemons/org.cirruslabs.orchard.worker.plist
orchardi@mac % tail -f /tmp/orchard-worker.log
{"level":"info","ts":1687469693.550679,"msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1687469698.553354,"msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1687469703.5502238,"msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"warn","ts":1687469707.148829,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"info","ts":1687469708.546686,"msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1687469713.624796,"msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1687469718.548105,"msg":"syncing 1 local VMs against 1 remote VMs..."}
{"level":"info","ts":1687469745.880018,"msg":"registered worker mac-M2GVQ20L75"}
{"level":"info","ts":1687469745.9966872,"msg":"syncing on-disk VMs..."}
{"level":"warn","ts":1687469746.322099,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"warn","ts":1687469746.995243,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"warn","ts":1687469747.868917,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"warn","ts":1687469749.248595,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"warn","ts":1687469751.3179488,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"warn","ts":1687469754.977362,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"info","ts":1687469755.83937,"msg":"syncing 1 local VMs against 0 remote VMs..."}
{"level":"info","ts":1687469756.161123,"msg":"syncing 1 local VMs against 0 remote VMs..."}
{"level":"info","ts":1687469760.6640599,"msg":"syncing 1 local VMs against 0 remote VMs..."}
{"level":"warn","ts":1687469761.893548,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"info","ts":1687469765.731574,"msg":"syncing 1 local VMs against 0 remote VMs..."}
{"level":"info","ts":1687469770.6702971,"msg":"syncing 1 local VMs against 0 remote VMs..."}
{"level":"warn","ts":1687469775.540767,"msg":"failed to watch RPC: rpc error: code = Internal desc = unexpected HTTP status code received from server: 400 (Bad Request); transport: received unexpected content-type \"text/plain; charset=utf-8\""}
{"level":"info","ts":1687469775.749318,"msg":"syncing 1 local VMs against 0 remote VMs..."}
{"level":"info","ts":1687469780.666985,"msg":"syncing 1 local VMs against 0 remote VMs..."}```

I've already deleted all VMs and restarted orchard. Any idea what could be causing this behaviour?
ruimarinho commented 1 year ago

Also having issues with vnc and ssh:

forwarding -> ventura-xcode-new:5900...
no credentials specified or found, trying default admin:admin credentials...opening vnc://admin@
failed to forward port: websocket.Dial wss://orchard.example.internal:443/v1/vms/ventura-xcode/port-forward?port=5900&wait=60: bad status
^C2023/06/22 22:31:15 context canceled
edigaryev commented 1 year ago

@ruimarinho can you check if the following ingress configuration works for you:

kind: Ingress
  name: orchard-ingress
  annotations: "HTTPS"
    - http:
          - path: /
            pathType: Prefix
                name: orchard
                  number: 6120
  ingressClassName: nginx
kind: Ingress
  name: orchard-ingress-grpc
  annotations: "GRPCS"
    - http:
          - path: /Controller
            pathType: Prefix
                name: orchard
                  number: 6120
  ingressClassName: nginx

It most certainly will need to be adapter for your environment, but the main idea is that without "GRPCS" treatment for /Controller path gRPC (which we use for port-forwarding) wouldn't work.

I've tried this on a local Kubernetes cluster and port-forwarding/SSH seem to work just fine.

ruimarinho commented 1 year ago

@edigaryev I've tested your suggestion but I'm getting a 504 timeout:

2023/06/27 12:30:38 [error] 2282#2282: *83097928 upstream timed out (110: Operation timed out) while reading response header from upstream, client:, server: orchard.example.internal, request: "POST /Controller/Watch HTTP/2.0", upstream: "grpcs://", host: "orchard.example.internal:443"

I'm using 443 for the PORT environment variable, but I've also tested with forward /Controller to 6120 just in case the gRPC server would be listening to a different port (not that the code suggest this...) and then I got a connection refused.

Theoretically, it's being forwarded correctly because nginx is complaining about a grpcs:// upstream - now I just need to figure out why is it timing out. The ingress is behind an AWS NLB.

If you have any suspicion, let me know, otherwise I'll keep digging. Thanks!

ruimarinho commented 1 year ago

@edigaryev after testing with a few more settings (grpc_connect_timeout, grpc_read_timeout, grpc_send_timeout), the best outcome I've come across is getting a 499 status code instead of a 504 (gateway timeout). It seems like occasionally I was able to get a 502 too:

ingress-nginx-controller-6c48cbfb6f-2czfc controller 2023/06/28 11:51:03 [error] 1253#1253: *1194409 no connection data found for keepalive http2 connection while sending request to upstream, client:, server: orchard.example.internal, request: "POST /Controller/Watch HTTP/2.0", upstream: "grpcs://", host: "orchard.example.internal:443"

After some investigation, it seems like nginx has an issue multiplexing HTTP/1.1 and gRPC, although I'm not entirely sure it's related with that here.

My suggestion would be to add a flag -- even a test build -- to run the gRPC server on a different port to see if that helps. There is nothing on the controller logs related to POST /Controller/Watch.

Any other ideas you may have?

Below is the nginx configuration block generated for /Controller:

``` location = /Controller { set $namespace "orchard"; set $ingress_name "controller-ingress-grpc"; set $service_name "controller-lb"; set $service_port "https"; set $location_path "/Controller"; set $global_rate_limit_exceeding n; rewrite_by_lua_block { lua_ingress.rewrite({ force_ssl_redirect = true, ssl_redirect = true, force_no_ssl_redirect = false, preserve_trailing_slash = false, use_port_in_redirects = false, global_throttle = { namespace = "", limit = 0, window_size = 0, key = { }, ignored_cidrs = { } }, }) balancer.rewrite() } # be careful with `access_by_lua_block` and `satisfy any` directives as satisfy any # will always succeed when there's `access_by_lua_block` that does not have any lua code doing `ngx.exit(ngx.DECLINED)` # other authentication method such as basic auth or external auth useless - all requests will be allowed. #access_by_lua_block { #} header_filter_by_lua_block { lua_ingress.header() } body_filter_by_lua_block { } log_by_lua_block { balancer.log() } port_in_redirect off; set $balancer_ewma_score -1; set $proxy_upstream_name "orchard-controller-lb-https"; set $proxy_host $proxy_upstream_name; set $pass_access_scheme $scheme; set $pass_server_port $server_port; set $best_http_host $http_host; set $pass_port $pass_server_port; set $proxy_alternative_upstream_name ""; client_max_body_size 1m; # Pass the extracted client certificate to the backend # Allow websocket connections grpc_set_header Upgrade $http_upgrade; grpc_set_header Connection $connection_upgrade; grpc_set_header X-Request-ID $req_id; grpc_set_header X-Real-IP $remote_addr; grpc_set_header X-Forwarded-For $remote_addr; grpc_set_header X-Forwarded-Host $best_http_host; grpc_set_header X-Forwarded-Port $pass_port; grpc_set_header X-Forwarded-Proto $pass_access_scheme; grpc_set_header X-Forwarded-Scheme $pass_access_scheme; grpc_set_header X-Scheme $pass_access_scheme; # Pass the original X-Forwarded-For grpc_set_header X-Original-Forwarded-For $http_x_forwarded_for; # mitigate HTTPoxy Vulnerability # grpc_set_header Proxy ""; # Custom headers to proxied server proxy_connect_timeout 180s; proxy_send_timeout 60s; proxy_read_timeout 180s; proxy_buffering off; proxy_buffer_size 16k; proxy_buffers 4 16k; proxy_max_temp_file_size 1024m; proxy_request_buffering on; proxy_http_version 1.1; proxy_cookie_domain off; proxy_cookie_path off; # In case of errors try the next upstream server before returning an error proxy_next_upstream error timeout; proxy_next_upstream_timeout 0; proxy_next_upstream_tries 3; grpc_pass grpcs://upstream_balancer; proxy_redirect off; } ```