emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.35k stars 680 forks source link

Ambassador does not properly handling web browser connection coalescing for HTTP/2 connections #2403

Open iNoahNothing opened 4 years ago

iNoahNothing commented 4 years ago

When you have multiple domains use the same certificate (e.g. the server has a certificate that can be used for domain domain.com and subdomains a.domain.com and b.domain.com) and the server supports HTTP/2, the browser will reuse the same connection for requests to domain.com, a.domain.com, and b.domain.com. See this blog post for more info.

If you have created an individual virtual_host for each of these domains in Ambassador (via Host resources or TLSContexts) Ambassador will reuse the same virtual_host, but with a different host in the request and you will get a 404.

In more detail, if you create a TLSContext like the below:

---
apiVersion: getambassador.io/v2
kind: TLSContext
metadata:
  name: context
spec:
  alpn_protocols: h2,http/1.1
  hosts:
  - domain.com
  - a.domain.com
  - b.domain.com
  secret: ambassador-cert

You will get an Ambassador configured where:

Now, when you send a request to https://a.domain.com/ambassador/v0/diag/ in a web browser, it opens a single HTTP/2 connection to Ambassador with :authority: a.domain.com. Ambassador then looks for a route in virtual_host: a.domain.com, find the route to /ambassador/v0, and correctly sends the request to the diagnostics page.

Now if you change the url to https://b.domain.com/ambassador/v0/diag/, the browser will reuse this same HTTP/2 connection to Ambassador but with :authority: b.domain.com. Ambassador then, reusing the same connection to virtual_host: a.domain.com, looks for a route in virtual_host: a.domain.com but since the :authority headers do not match any routes, returns a 404.

To Reproduce

Reproduction is pretty simple.

  1. Deploy Ambassador

  2. Get a certificate for *.domain.com

  3. Create a TLSContext that uses that certificate and sets

    • alpn_protocols: h2,http/1.1
    • hosts: [ a.domain.com, b.domain.com]
  4. Send a request to https://a.domain.com/ambassador/v0/diag/ in a browser and get the diag page

  5. Change the url to https://b.domain.com/ambassador/v0/diag/ and get a 404

Workaround

Since this issue revolves around how Ambassador is creating virtual_hosts and using the same certificate, a couple of possible workarounds exist that could be used until this is resolved.

  1. Create a different certificate and TLSContext for each domain

    ---
    apiVersion: getambassador.io/v2
    kind: TLSContext
    metadata
     name: domain-context
    spec:
     alpn_protocols: h2,http/1.1
     hosts:
     - domain.com
     secret: domain-cert
    ---
    apiVersion: getambassador.io/v2
    kind: TLSContext
    metadata
     name: a-domain-context
    spec:
     alpn_protocols: h2,http/1.1
     hosts:
     - a.domain.com
     secret: a-domain-cert
    ---
    apiVersion: getambassador.io/v2
    kind: TLSContext
    metadata
     name: b-domain-context
    spec:
     alpn_protocols: h2,http/1.1
     hosts:
     - b.domain.com
     secret: b-domain-cert

    This will make is so the browser does not reuse the same connection for a.domain.com and b.domain.com since it cannot use the same certificate.

  2. Use a wildcard in the TLSContext so that domain.com, a.domain.com, and b.domain.com use the same virtual_host

    ---
    apiVersion: getambassador.io/v2
    kind: TLSContext
    metadata
     name: wild-context
    spec:
     alpn_protocols: h2,http/1.1
     hosts:
     - "*"
     secret: wild-cert

    Now, when the browser reuses the connection, Ambassador will use the same virtual_host which will match for all :authoritys

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

markusjevringsesame commented 3 years ago

Is this an issue for other people as well. We have what we believe is the same issue, but switching to a wildcard (workaround 2) didn't work for us. We had to disable h2 for things to start working again, but ideally we'd like to re-enable h2 with a fix for this.

markusjevringsesame commented 3 years ago

~Another workaround that did work for us was to get rid of the TLSContext completely. The Host already specifies the certificate secret, and you can put configuration parameters like the alpn_protocols directly in the host. The TLSContext isn't actually needed (for most cases).~ I don't know why I said this. This doesn't actually work.

markusjevringsesame commented 3 years ago

According to @cindymullins-dw, this was fixed in 1.7 in August 2020, so this ticket can possibly be closed.

LukeShu commented 2 years ago

This should have been closed along with https://github.com/datawire/apro/issues/1167 (which is just a mirror of the issue) by the PR https://github.com/datawire/apro/pull/1716 (seen in Emissary as https://github.com/emissary-ingress/emissary/commit/df926097c872e09851525de7ffeaa6e8577670d0) (which did close that mirror issue), which was included in v1.7.0 on 2020-08-27.

Though https://github.com/datawire/apro/pull/1907 (seen in Emissary as https://github.com/emissary-ingress/emissary/commit/7835f087056f5ba589ec19c51ff274107988b907) (for inclusion in v1.7.4) (reverted a bunch of changes to v2listener.py, it specifically did not revert the changes from datawire/apro#1716 because (as the commit message says) "is a fix that EPO cares about."

LanceEa commented 1 year ago

@LukeShu - as we discussed offline, this fix didn't cover all the cases so I'm going to reopen this and outline what we discussed so we have a record of it.

Case 1: Coalesce wildcard sub-domains (ie. a.example.com, b.example.com) - ✅ Case 2: Coalesce wildcard sub-domains with parent domain (i.e. a.example.com, b.example.com and example.com) - ⛔

The first case was resolved per the fix that you referenced which means we will coalesce all the wild-card subdomains into a single envoy Filter Chain that does SNI matching on *.example.com.

In the second case, when a TLS certificate has SAN names registered for both wild-card domains and parent domain then the browser will try to re-use the connection.

X509v3 Subject Alternative Name:
      DNS:*.example.com, DNS:example.com

We currently generate Envoy configuration so that we have two FIlter Chains that do the L4 SNI matching for *.example.com and one for example.com. Navigating to a wild-card domain first will open a connection and the TLS Handshake will use the .example.com domain for SNI. The browser will re-use the open connection when navigating directly to the parent domain. Since SNI is negotiated at TLS Handshake time, Envoy will re-use the connection and looks in the Filter Chain for `.example.comand then when it tries to do the L7 matching on:authority == example.com`, there is no route available causing the 404 NR.

Chrome: net-internals shows the same connection being used for the wildcard and parent domain. Screen Shot 2022-09-20 at 1 20 57 PM

Workaround: A non-code solution is for the user to use two different TLS Certs. One that has SAN for the wild card domain and the another one for the parent domain. By doing this the browser will re-use the existing connection for all wild-card domains (i.e. a.example.com, b.example.com) but will open a new connection for requests to the parent domain (example.com) since they no longer share a cert and can re-use the connection.

Chrome: net-internals using different connections when using the workaround.

Screen Shot 2022-09-20 at 2 41 28 PM

Potential Fix: Emissary will need to take into account the TLS Certs and the SANs registered within the cert along with the host matching to ensure that when the browser re-uses the connection that both the wild-card domains and parent domain can be matched in a single Filter Chain.

FYI... @ddymko @haq204 @AliceProxy I think this is a good one to be aware of.

blakehawkins commented 1 year ago

I was also seeing these symptoms for probably the same reason, but my underlying issue and solution were a little more involved, and there were probably ultimately multiple issues.

I set up a second Certificate, Host, and TLSContext as described here, in order to serve subdomains on a different cert than an apex domain.

However, my second Certificate was not becoming Ready -- in particular, cert-manager wasn't producing a challenge because it failed to match the second certificate to any solver. It was unclear if that was due to misconfiguration of a Host/TLSContext/etc.

In my case that configuration was correct but the underlying issue is that lets encrypt specifically doesn't support http01 challenges for wildcard domains.

Switching from http01 challenge solver to dns01 challenge solver allowed the challenge to be produced, which in turn made the second certificate become ready, and the issue went away.

A key thing to notice is that the cert-manager flow replaces the cert secrets only after a flow completes successfully, which means that if it doesn't complete successfully, existing config continues to be used, which led to some confusion here when I continued to see the apex cert being served on subdomains.

One possible improvement in emissary might be to refuse to serve the "apex" cert in this case across domains since it's a known issue, unless the user opts-in to reuse using a feature flag like allow_unsafe_ssl_cert_subdomain_reuse: true

(Just an idea -- my case is completely resolved now. Thanks for your work on this!)