Open ktalg opened 1 month ago
cc @adisuissa
I don't have much experience with SDS, but I'm curious why we end up not reconnecting to the SDS cluster? What do the stats for the SDS cluster say and what is its configuration?
cc @adisuissa
I don't have much experience with SDS, but I'm curious why we end up not reconnecting to the SDS cluster? What do the stats for the SDS cluster say and what is its configuration?
I conducted another test, configuring mTLS at the downstream transport_socket.
...
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
'@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
combined_validation_context:
default_validation_context:
allow_expired_certificate: true
match_subject_alt_names:
- exact: spiffe://server.example.com/istio-proxy
validation_context_sds_secret_config:
name: spiffe://server.example.com
sds_config:
api_config_source:
api_type: GRPC
grpc_services:
- google_grpc:
stat_prefix: spire_agent
target_uri: spire-agent-proxy.spire.svc.cluster.local:9001
transport_api_version: V3
initial_fetch_timeout: 120s
resource_api_version: V3
tls_certificate_sds_secret_configs:
- name: spiffe://server.example.com/istio-proxy
sds_config:
api_config_source:
api_type: GRPC
grpc_services:
- google_grpc:
stat_prefix: spire_agent
target_uri: spire-agent-proxy.spire.svc.cluster.local:9001
transport_api_version: V3
initial_fetch_timeout: 120s
resource_api_version: V3
tls_params:
ecdh_curves:
- X25519:P-256:P-521:P-384
require_client_certificate: true
...
From the monitoring data, the connection was actively terminated by self at the 18-minute mark, which is quite STRANGE!
By the way, I replaced the Spire SDS proxy from Nginx to Envoy to output more detailed monitoring logs.. Here are the Spire-Envoy logs:
{"authority":"spire-agent-proxy.spire.svc.cluster.local:9001","bytes_received":"32145","bytes_sent":"4430","connection_termination_details":"-","downstream_local_address":"10.11.129.239:9001","downstream_remote_address":"172.17.8.199:46906","duration":"1126536","method":"POST","path":"/envoy.service.secret.v3.SecretDiscoveryService/StreamSecrets","protocol":"HTTP/2","request_id":"00e14d3e-9962-47a0-952d-ecdc8c461f29","requested_server_name":"-","response_code":"200","response_code_details":"downstream_remote_disconnect","response_flags":"DC","route_name":"-","start_time":"2024-07-18T05:27:56.143Z","upstream_cluster":"local","upstream_host":"/run/spire/sockets/agent.sock","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"grpc-c++/1.39.0-dev grpc-c/17.0.0 (linux; chttp2)","x_forwarded_for":"-"}
{"authority":"spire-agent-proxy.spire.svc.cluster.local:9001","bytes_received":"48143","bytes_sent":"39330","connection_termination_details":"-","downstream_local_address":"10.11.129.239:9001","downstream_remote_address":"172.17.8.199:46906","duration":"1126538","method":"POST","path":"/envoy.service.secret.v3.SecretDiscoveryService/StreamSecrets","protocol":"HTTP/2","request_id":"8b89c61d-8883-4525-bdab-10cd5a15b3f1","requested_server_name":"-","response_code":"200","response_code_details":"downstream_remote_disconnect","response_flags":"DC","route_name":"-","start_time":"2024-07-18T05:27:56.140Z","upstream_cluster":"local","upstream_host":"/run/spire/sockets/agent.sock","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"grpc-c++/1.39.0-dev grpc-c/17.0.0 (linux; chttp2)","x_forwarded_for":"-"}
and Spire-Envoy‘s config:
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 9001 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
access_log:
- name: envoy.access_loggers.file
typed_config:
'@type': type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
log_format:
text_format: |
{"authority":"%REQ(:AUTHORITY)%","bytes_received":"%BYTES_RECEIVED%","bytes_sent":"%BYTES_SENT%","connection_termination_details":"%CONNECTION_TERMINATION_DETAILS%","downstream_local_address":"%DOWNSTREAM_LOCAL_ADDRESS%","downstream_remote_address":"%DOWNSTREAM_REMOTE_ADDRESS%","duration":"%DURATION%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%","protocol":"%PROTOCOL%","request_id":"%REQ(X-REQUEST-ID)%","requested_server_name":"%REQUESTED_SERVER_NAME%","response_code":"%RESPONSE_CODE%","response_code_details":"%RESPONSE_CODE_DETAILS%","response_flags":"%RESPONSE_FLAGS%","route_name":"%ROUTE_NAME%","start_time":"%START_TIME%","upstream_cluster":"%UPSTREAM_CLUSTER%","upstream_host":"%UPSTREAM_HOST%","upstream_local_address":"%UPSTREAM_LOCAL_ADDRESS%","upstream_service_time":"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%","upstream_transport_failure_reason":"%UPSTREAM_TRANSPORT_FAILURE_REASON%","user_agent":"%REQ(USER-AGENT)%","x_forwarded_for":"%REQ(X-FORWARDED-FOR)%"}
path: /dev/stdout
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: [ "*" ]
routes:
- match: { prefix: "/" }
route: { cluster: local }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
stream_idle_timeout: 1800s
clusters:
- name: local
connect_timeout: 0.25s
http2_protocol_options:
max_concurrent_streams: 1
common_http_protocol_options:
idle_timeout: 60s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: local
endpoints:
- lb_endpoints:
- endpoint:
address:
pipe:
path: /run/spire/sockets/agent.sock
What's even stranger is that if I set the stream timeout to 15 minutes in the Spire-Envoy configuration
...
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: [ "*" ]
routes:
- match: { prefix: "/" }
route: { cluster: local }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
stream_idle_timeout: 900s <<<<<<<<<<<<<<<changed
clusters:
- name: local
connect_timeout: 0.25s
http2_protocol_options:
max_concurrent_streams: 1
common_http_protocol_options:
idle_timeout: 60s
...
the SDS connection will be forcibly terminated after the timeout, causing Envoy to reconnect. This way, the connection can be maintained continuously. This is very confusing to me.
My Envoy establishes a connection with SDS without using TLS (I know this is not secure, but please ignore this for now). Could this be causing the issue? envoy --(tcp)--> envoy(proxy) --(uds)--> spire-sds
General information:
From the segments of the config it is a bit unclear what's going on, so I'll try to rephrase the configuration and if we agree we can go from there.
The configuration uses GoogleGrpc in order to connect to the SDS server. No ADS is used, but IIUC all sds_config
segments use the same configuration server.
I'm assuming that these are all static resources, so Envoy should create the xDS-connection, and fetch the SDS when it starts up. Note that this should happen even before the first data-plane request.
Moreover, the xDS path should be decoupled from the data-plane path, as this is not an on-demand service, but rather xDS is a pub/sub where Envoy should fetch all the certificates at the beginning. Whenever something is updated, it will be pushed by the server to the Envoy.
When an xDS connection is reset, Envoy will attempt to reconnect (exponential backoff). Using EnvoyGrpc allows one to control the knobs of that backoff. GoogleGrpc might have a similar option in the underlying lib, but I'm not sure.
I observed that within a certain period, Envoy will reconnect to SDS and obtain the latest certificates, but after a few hours, it will not actively reconnect
Seems strange. I suggest increasing the log-level and seeing whether Envoy is actually not trying to connect, or trying and failing. I guess the former, but then I wonder if Envoy is aware that the xDS-connection was terminated (or was it masked by the gRPC library).
From the last comments, it doesn't seem that the issue is when Envoy disconnects from the SDS server, but rather when a Envoy data-plane disconnects from an upstream, upon reconnection to the upstream, Envoy will not find the certificate, although it should already have it.
What's even stranger is that if I set the stream timeout to 15 minutes in the Spire-Envoy configuration the SDS connection will be forcibly terminated after the timeout, causing Envoy to reconnect.
I'm not familiar with this, but can you provide details of which stream idle timeout is modified (is it on the data-plane config or the xDS-connection related config)?
General information: From the segments of the config it is a bit unclear what's going on, so I'll try to rephrase the configuration and if we agree we can go from there. The configuration uses GoogleGrpc in order to connect to the SDS server. No ADS is used, but IIUC all
sds_config
segments use the same configuration server. I'm assuming that these are all static resources, so Envoy should create the xDS-connection, and fetch the SDS when it starts up. Note that this should happen even before the first data-plane request. Moreover, the xDS path should be decoupled from the data-plane path, as this is not an on-demand service, but rather xDS is a pub/sub where Envoy should fetch all the certificates at the beginning. Whenever something is updated, it will be pushed by the server to the Envoy.When an xDS connection is reset, Envoy will attempt to reconnect (exponential backoff). Using EnvoyGrpc allows one to control the knobs of that backoff. GoogleGrpc might have a similar option in the underlying lib, but I'm not sure.
I observed that within a certain period, Envoy will reconnect to SDS and obtain the latest certificates, but after a few hours, it will not actively reconnect
Seems strange. I suggest increasing the log-level and seeing whether Envoy is actually not trying to connect, or trying and failing. I guess the former, but then I wonder if Envoy is aware that the xDS-connection was terminated (or was it masked by the gRPC library).
From the last comments, it doesn't seem that the issue is when Envoy disconnects from the SDS server, but rather when a Envoy data-plane disconnects from an upstream, upon reconnection to the upstream, Envoy will not find the certificate, although it should already have it.
What's even stranger is that if I set the stream timeout to 15 minutes in the Spire-Envoy configuration the SDS connection will be forcibly terminated after the timeout, causing Envoy to reconnect.
I'm not familiar with this, but can you provide details of which stream idle timeout is modified (is it on the data-plane config or the xDS-connection related config)?
After extensive testing, I can confirm the issue details:
channel_args:
args:
grpc.http2.max_pings_without_data:
int_value: 0
grpc.keepalive_permit_without_calls:
int_value: 1
grpc.keepalive_time_ms:
int_value: 180000
grpc.keepalive_timeout_ms:
int_value: 2000
This allows the client to detect and reconnect, avoiding the problem.
This behavior confirms that google_grpc without keepalive can cause connection leaks. The issue may involve "idle timeout" configurations in the network environment. I've checked kernel parameters but found nothing relevant.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Description: Envoy establishes a network connection with Spire via Nginx: envoy -> nginx -> spire-sds. To achieve balanced load, Nginx will actively disconnect every 30 minutes:
I observed that within a certain period, Envoy will reconnect to SDS and obtain the latest certificates, but after a few hours, it will not actively reconnect:
After this, envoy will no longer establish a connection with spire-sds.This results in subsequent requests failing the handshake:
Envoy version: 1.19.4 Envoy configuration:
Is this behavior expected? Does Envoy detect certificate expiration and actively initiate a certificate pull request?