cirruslabs / orchard

Orchestrator for running Tart Virtual Machines on a cluster of Apple Silicon devices
Other
197 stars 17 forks source link

TLS architecture and Kubernetes thoughts #86

Closed ruimarinho closed 1 year ago

ruimarinho commented 1 year ago

Hi,

First of all, thanks for open sourcing this project! I'd like to provide some feedback on the current implementation of TLS to understand if there is a better path going forward in terms of how it is architected today.

As you may already know, on a Kubernetes deployment, the ingress typically performs the TLS termination and then proxies traffic to the service, which could either by to a HTTP or HTTPS backend.

The first problem is that TLSv1.3 is being required exclusively, requiring changes to the typical nginx ingress controller configuration (the annotation won't work due to other requirements):

    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/server-snippet: |
      proxy_ssl_protocols TLSv1.2 TLSv1.3;

With this change, the nginx controller is able to reach the HTTPS backend with the self-signed certificate created by orchard.

Recommendation: TLSv1.2 should also be supported to avoid custom server snippets.

The approach taken by the team to have orchard server generate a self-signed certificate is fine, but unfortunately the orchard client is forcing validation of a single certificate. This makes it very difficult - if not impossible - to have the kubernetes nginx controller become responsible for the orchard client TLS termination, since a proper setup will likely include a chain of the Root CA and/or Intermediate CA(s). In my case, the edge TLS certificate (e.g. orchard-controller.example.internal) and an Intermediate CA are being bundled by the nginx controller, breaking the orchard client.

Recommendation: validate the presented certificate or certificated chain against locally-trusted Root CAs.

The client has an hard-coded SNI check for orchard-controller.. This makes it incompatible with custom internal domains where, for example, the aforementioned nginx controller is responsible for handling TLS. It also makes it impossible to implement a workaround using nginx's HTTPS passthrough (which isn't ideal anyway).

Recommendation: do not forcefully use a hard-coded SNI and instead use whatever hostname is passed via orchard context create [...] https://<hostname>:<port>.

Although the previous limitations already make it almost impossible to use custom certificates, there is also another bug which prevents a custom certificate private key to be loaded. Both parameters point to the public key, even if the private is passed in.

Please note that TLS config does not seem to be consistently used across all calls, which makes the behaviour differ depending on the type of call orchard command being executed.

Lastly, for others attempting to run this on Kubernetes, the best option so far is to publish a LoadBalancer service type directly pointing to the orchard controller. The only downside is the constant spam from cloud providers' use of TLSv1.2 for health checks on the origin, which orchard rejects.

edigaryev commented 1 year ago

Hello Rui! Thanks for such a detailed and an actionable report 🙌

Most of the issues you've listed should be addressed by the upcoming PR https://github.com/cirruslabs/orchard/pull/90, expect for the SNI, which is something we need to think about on how to implement properly to support cases where the FQDN (or an IP-address) of an Orchard Controller is not evident/static at the time of creation of the certificate.

edigaryev commented 1 year ago

@ruimarinho please check out the new 0.8.0 release, it includes the fixes for the issues you've mentioned.

ruimarinho commented 1 year ago

Yep, just saw it! Thank you! I’ll provide feedback next week as I’m away this week.

ruimarinho commented 1 year ago

@edigaryev just tried 0.8.0 and I'm still getting the same issue, with and without --no-pki:

❯ orchard context create --name example-0-8-0 --service-account-name bootstrap-admin --service-account-token <token> https://orchard.example.internal:443

❯ orchard list vms
2023/06/21 20:21:32 API client failed to make a request: Get "https://orchard.example.internal:443/v1/vms": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not orchard-controller

With --no-pki:

❯ orchard context create --name example-0-8-0-no-pki --service-account-name bootstrap-admin --service-account-token <token> https://orchard.example.internal:443 --no-pki
The authencity of controller orchard.example.internal:443 cannot be established.
Certificate SHA-256 fingerprint is 2B FD [...].
Are you sure you want to establish trust to this certificate? (yes/no) yes
2023/06/21 20:21:47 API client failed to make a request: Get "https://orchard.example.internal:443/v1//": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not orchard-controller

Any idea what I could be doing wrong?

ruimarinho commented 1 year ago

Just figured out that orchard context create does not automatically switch contexts, so the first command (which had returned successfully) actually works if switching to that context. 👌

For future reference, that's simply orchard context default <new-context>.

ruimarinho commented 1 year ago

@edigaryev it seems like worker.go doesn't perform SNI parsing from the host.

/opt/homebrew/bin/orchard worker run https://orchard.example.internal:443

{"level":"warn","ts":1687429163.497738,"caller":"worker/worker.go:109","msg":"failed to register worker: API client failed to make a request: Post \"https://orchard.example.internal:443/v1/workers\": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not orchard-controller"}
2023/06/22 11:19:23 failed to register worker on the controller

Could you please have a look?

ruimarinho commented 1 year ago

Just to add more debugging information, it looks like the bootstrap-token command also breaks when no certificate is presented:

❯ orchard get bootstrap-token ci
2023/06/22 11:40:24 failed to create bootstrap token: empty certificate

Since the bootstrap-token contains the certificate as well, I've tried removing it from the token itself (i.e. calling /opt/homebrew/bin/orchard worker run https://orchard.example.internal:443 --bootstrap-token <token> where token does not contain the certificate component).

This yields:

2023/06/22 11:41:30 invalid bootstrap token format: missing certificate
edigaryev commented 1 year ago

Hello @ruimarinho 👋

Thanks for reporting this! I've totally overlooked the fact that we also need to to add the certificate-less support for bootstrap tokens.

Should be fixed as a part of https://github.com/cirruslabs/orchard/pull/93.