Closed sebas2day closed 2 years ago
Maybe this problem has been caused by the wrong TLS session management on the same IP address with my quick investigation. (not reached to the root cause of this problem. So we need further investigation.) https://github.com/envoyproxy/envoy/blob/351c0ca82e28e19750102cfc1beb5eca8c4f2542/source/extensions/transport_sockets/tls/context_impl.cc#L666-L669
In this case you should configure different cluster for those services, cluster is a collection of endpoint of same logical services, in this case it doesn't seems to be that. What's the reason for those endpoints in same cluster?
In this case you should configure different cluster for those services, cluster is a collection of endpoint of same logical services, in this case it doesn't seems to be that. What's the reason for those endpoints in same cluster?
service
is just used as an example here just like that this example looks all static. In reality we have a dynamic number of clusters for each application. So the cluster with lb_endpoints
represents one application. We actually tried making a separate cluster for each possible value service
can have, but this causes a different problem, because service
can have around 3000
different values you get 3000
clusters for a single application. The startup time of Envoy drastically slows down doing this and the amount of allocated memory would run into many GBs. It works, but it's undesirable.
Using lb_endpoints
seemed like a good solution but I don't understand why it reuses the SNI of the previous call even though the host header is actually different.
I'm not sure this problem should be resolved from code level. But, at all, this problem has been caused by TLS session resumption on the same IP address. So it will be resolved by configuring max_session_keys=0
, which disables TLS session resumption by default.
https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/transport_sockets/tls/v3/tls.proto#envoy-v3-api-msg-extensions-transport-sockets-tls-v3-upstreamtlscontext
max_session_keys=0
indeed resolves the issue. I'm not sure if disabling TLS session resumption would cause possible performance issues for clients?
For now, I tested this locally on my machine and don't see any real difference. Probably need to verify this on deployed environments
@Shikugawa this is not related to TLS session resumption at all. It is HTTP connection pool management.
because
service
can have around3000
different values you get3000
clusters for a single application.
If they are the same application, why would they return 421
even the endpoint have the capacity to respond the the request?
Using
lb_endpoints
seemed like a good solution but I don't understand why it reuses the SNI of the previous call even though the host header is actually different.
HTTPS allow us to reuse connection as much as possible for the efficiency. Even browsers do send requests over same HTTPS connection even the host header is different as long as they resolves to same IP and certificate matches. 421
indicates that request should be retried (and that's why 421 exists). See https://github.com/envoyproxy/envoy/issues/6767#issuecomment-488811660 and that's the issue for retry behavior.
@lizan As described in #6767, the unacceptable behavior is as follows. If we have two origin servers, they have different certificates (in rfc7540, one has *.example.com in SAN field, the other has a.example.com in SAN field). But in this case, two origins share same wildcard certificate. As far as I know, connection reuse may occur if the request has the same IP address and hostname in the case of HTTP/2 with TLS, and it is acceptable if a presented certificate from the origin is valid. In this case, all the conditions to the reuse connection are satisfied. So I think the behavior in this case is following with HTTP/2 spec. This is why I considered this problem is not to originate from HTTP connection management. I couldn't catch where the 421 came from, but, according to the spec, it should be responded from origin server.
@sebas2day Back to your original issue, I think using dynamic forward proxy might be the fastest way to resolve it without configuring all clusters. That might be more suitable for your use case since it resolves DNS and treat every endpoint differently by hostname.
@lizan
Back to your original issue, I think using dynamic forward proxy might be the fastest way to resolve it without configuring all clusters. That might be more suitable for your use case since it resolves DNS and treat every endpoint differently by hostname.
That's actually what we tried initially but interesting enough you'll get the exact same issue. That's when we started to investigate different configurations to fix the issue without avail unfortunately.
Please have a look at the following config along with the setup I described in the initial post:
node:
id: envoy_example
cluster: envoy_example
static_resources:
listeners:
- name: envoy_proxy
address:
socket_address:
address: '0.0.0.0'
port_value: '8080'
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: test
route_config:
name: route_configuration
virtual_hosts:
- name: envoy_host
domains: [ "*" ]
routes:
- name: some_route
match:
prefix: "/"
route:
cluster: "example_application"
typed_per_filter_config:
envoy.filters.http.dynamic_forward_proxy:
"@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.PerRouteConfig
host_rewrite_header: ':destination'
http_filters:
- name: envoy.filters.http.lua
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
inline_code: |
function envoy_on_request(request_handle)
local service = request_handle:headers():get("service")
request_handle:headers():replace(":destination", service .. ".example.com:8002")
end
- name: envoy.filters.http.dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.FilterConfig
dns_cache_config:
name: dynamic_forward_proxy_cache_config
dns_lookup_family: V4_ONLY
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
suppress_envoy_headers: true
clusters:
- name: "example_application"
connect_timeout: 1s
lb_policy: CLUSTER_PROVIDED
cluster_type:
name: envoy.clusters.dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
dns_cache_config:
name: dynamic_forward_proxy_cache_config
dns_lookup_family: V4_ONLY
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
validation_context:
trusted_ca:
filename: /etc/ssl/certs/ca-certificates.crt
trust_chain_verification: ACCEPT_UNTRUSTED
@Shikugawa
But in this case, two origins share same wildcard certificate. As far as I know, connection reuse may occur if the request has the same IP address and hostname in the case of HTTP/2 with TLS, and it is acceptable if a presented certificate from the origin is valid. In this case, all the conditions to the reuse connection are satisfied.
I'm curious why the resolved IP address (that ends up being the same) for both calls matters and ends up in different behavior? For me, as a user, I ideally don't want the think about what endpoint lives on what host. I would expect that when I explicitly state endpoints with an explicit SNI it should not attempt to reuse the same connection but make a separate connection for each endpoint instead. Calls to the endpoint can than reuse their dedicated connection.
@sebas2day Let's go back to the first discussion. In the current configuration, Upstream Connection is supposed to use HTTP/1.1 and not reuse the connection. So there is no problem because each endpoint is using a different connection. I think this problem is caused by Envoy's problem with certificate validation when reusing TLS sessions. Therefore, the problem can be solved by not reusing the session.
Sorry I'm mixing up connections with TLS sessions. I checked but the issue is regardless of whether it's http2 or when they have different certificates without having a wildcard.
I think this problem is caused by Envoy's problem with certificate validation when reusing TLS sessions. Therefore, the problem can be solved by not reusing the session.
To me having TLS session resumption sounds like a good thing to have, but I think I want it to be per endpoint and not per host? Disabling it sounds to me like a workaround rather than a fix. Please correct me if I'm wrong since my knowledge in this area is quite limited.
@PiotrSikora I don't know the details of this implementation, but I think the current implementation is to reuse the session ticket in the connection as long as it exists. Does it make sense to do SNI-based ticket selection here? https://github.com/envoyproxy/envoy/blob/v1.20.0/source/extensions/transport_sockets/tls/context_impl.cc#L648-L672
@Shikugawa again this is not related to TLS session ticket in any way. If the DNS resolves to same IP address, it will reuse the connection even it is HTTP/1.1. That matches browser's behavior as well.
Ok. I will do further investigation. But, we don't have any approach to avoid this problem without disabling session resumption in the current implementation. It is just a workaround.
@lizan This is the trace result. (Both DFP and non-DFP show almost the same result.) https://gist.github.com/Shikugawa/537f3df4fe9c58f20ebdcf94c1ca1952 With my investigation, the current config doesn't reuse the previous HTTP connection... You can find out actual logs here. As for what you said, SNI-based connection reuse should be implemented. But I also think that is not the root cause of this problem...
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Title: Incorrect SNI set for different endpoints that live on the same host
Description: We have Envoy proxying requests to endpoints using a request header. All proxied requests need to use TLS. Endpoints share the same certificate
*.example.com
.Our setup consist of a single cluster with
STRICT_DNS
having multiplelb_endpoint
s where they are selected usingsubset_selectors
which is based on the request header. In order to get the correct SNI being set we can't use hostname on the cluster endpoint becauseauto_sni
is based on the downstream host header. This means we need to construct the host header using a Lua filter. Making a request sets the correct SNI to the endpoint but when we make another request to a different endpoint (that lives on the same host) it somehow reuses the SNI of the initial request. Observing the logs it does seem it's establishing new connections (so not reusing them?). This results in421
HTTP responses in our setup.Reproduction scenario:
Nginx
Hosts
Run
Last call will show the SNI of the previous call.