[discuss] Improving default TLS validation behavior for upstream connections

fuhry commented 2 years ago

Title: Decide whether to improve the default peer validation behavior in the tls transport_socket extension, at least for UpstreamTlsContext.

Description: It is known and documented that Envoy's tls transport socket extension does not verify the peer at all by default. There are numerous reasons why a technical decision like this might be justified:

Envoy started out as a mesh proxy for internal services, and for an internal listener might use different values for the IP/hostname used to establish the socket, TLS SNI, and expected certificate peer name (CN or SAN). The current CertificateValidationContext is clearly aimed at this use case, and provides a great deal of flexibility for this type of deployment.
The part of the tls transport_socket extension that establishes the connection (ClientContextImpl::newSsl) is distinct from the part that performs the validation (cert_validator extension), and while it does support SNI, assuming that the SNI value should always match the peer certificate CN/SAN limits flexibility for the use case above. It looks like only the certificate and configuration options are exposed to the validator extension. I haven't gotten too in the weeds with Network::TransportSocketOptions but it doesn't look connection-specific which means the interface would need to be changed to add connection metadata (SNI value, peer socket address) to the verify callback.
Reading and using the system's default trusted root certificate store is difficult and annoying to do in a cross-platform way, and not useful in the primary mesh/microservice proxy case when just about everyone is going to be using their own internal mTLS with a CA certificate that gets pushed through SDS or deployed alongside the envoy binary. Without a way to reliably detect and use the system's trusted root store, we really can't turn on peer verification by default.
Even if boringssl supports the above, it might require us to use bssl's built-in verification functionality, while currently envoy only supports verification using its own cert_validator extensions.

But a number of things have changed recently:

Envoy has gained the ability to also be an outbound proxy with the introduction of the dynamic_forward_proxy family of extensions, and if the DFP cluster extension is being used in a way that Envoy will be responsible for peer validation, it's important that this validation be robust when you're talking to internet hosts!
- NOTE: It looks like dynamic_forward_proxy actually independently verifies the peer SAN before selecting a connection out of the pool, and this was important enough to be verified with a functional test. This is really good actually, because it means dynamic_forward_proxy can be used in a secure manner without requiring clients to use a CONNECT tunnel, as long as the operator remembers to configure trusted_ca or watched_directory.
BoringSSL gained the ability to verify the peer name automatically upon connection in December 2021.
- Caveat: SSL_set1_host needs to be called before the handshake, which means the validation extension hasn't been called yet, and there's no provision in the current interface to hook connection setup.
Envoy has started to take the stance of rejecting blatantly insecure TLS configurations, yet the default behavior of not verifying at all is silently allowed without even a warning.
More widespread use, both in terms of number of users and versatility, is in-and-of itself an argument for more secure defaults.

And of course there's the contradiction that when CertificateValidationContext.trust_chain_verification is set to VERIFY_TRUST_CHAIN but the trusted_ca and watched_directory options are not also configured, no peer verification is actually done - and worse, this is the default. This configuration should be a fatal error, but currently it's silently accepted without even so much as a warning. It's really easy to trip up here and ship an insecure configuration if you miss that sentence in the docs!

As a user, I would assume that if the default value of trust_chain_verification is VERIFY_TRUST_CHAIN, then an unconfigured trusted_ca would use the system trust store, and an unconfigured match_typed_subject_alt_names would check the peer SAN against the SNI value or hostname/IP of the peer according to the underlying socket. If this is not going to be the case, the docs should have a big warning box advising users to configure some sort of peer verification, and Envoy should reject configurations with trust chain verification enabled but no trusted roots configured.

So the central question is:

Should Envoy adopt reasonable defaults for upstream TLS transport sockets?
- "Reasonable defaults" would be defined as: use system trusted root CA store; use peer socket address and/or SNI value to validate peer name; verify expiration dates; support checking for revocation using OCSP or CRLs (??? maybe this isn't feasible to do by default, due to airgapped deployments, need to configure proxies to fetch CRLs or talk to OCSP, etc.)

Side questions:

Decide migration behavior, warning/deprecation periods, etc.
Is there a way to just get the system wide trusted root cert list from bssl and still use our own validator?
What should we do if loading the system trusted root list in a cross-platform way isn't feasible, or if that process fails? What then does the default behavior become when trusted_ca and trust_chain_verification are both unset? If this fails with a fatal error, how concerned are we that this would constitute a regression for existing configurations? Should we require trust_chain_verification=ACCEPT_UNTRUSTED to be explicitly set in order to skip peer verification?

Separate issues will be created to track technical work once we arrive at a decision here.

Relevant Links:

https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/transport_sockets/tls/v3/common.proto

ggreenway commented 2 years ago

There's several things to track here.

Things we should do immediately:

[ ] Prominently document that if trusted_ca isn't set, no verification will be performed. Will need to clarify that this matters for downstream mTLS and upstream TLS, but not for downstream non-mutual TLS.

Things that make obvious sense to add:

[ ] Option to validate upstream SANs based on the connection's SNI value. My guess is that it will be easier and more consistent to add this capability to Envoy's existing cert validation, rather than put part of cert validation into BoringSSL, but this assumption needs to be investigated/validated.

Things that need more discussion:

[ ] Adding code in envoy to load the system trust store. From a brief search, it appears that across linux distros, there's not a standard location for the system trust store. Anyone aware of a way to do this that will work on all of the common linux distros? It may make sense to leave this as a control/configuration-plane task.
[ ] Changing the validation defaults. Reject configuration that doesn't opt-in to not validating peer certs.

mattklein123 commented 2 years ago

I agree with @ggreenway summary. In general I would like to move the defaults to be more secure, but it's difficult with compatibility guarantees.

Changing the validation defaults. Reject configuration that doesn't opt-in to not validating peer certs.

One thing I think we could do here that is relatively low friction would be to WARN on any config which is loaded that does not do validation. Then at least the user would be aware (most users look at warnings). We could then have some CLI flag or something to squelch this warning if people really truly do want this.

I know that @alyssawilk has also thought about this before and when we last talked about this we concluded that even figuring what "validation" means is tricky, as Envoy is capable of many different types of validation including pinning, etc.

alyssawilk commented 2 years ago

yeah I think this overlaps some with https://github.com/envoyproxy/envoy/issues/17771 which unfortunately has been on the back burner for some time

envoyproxy / envoy

[discuss] Improving default TLS validation behavior for upstream connections #21409