envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.67k stars 4.75k forks source link

Question: When does Envoy actively pull certificates from SDS? #35216

Open ktalg opened 1 month ago

ktalg commented 1 month ago

Description: Envoy establishes a network connection with Spire via Nginx: envoy -> nginx -> spire-sds. To achieve balanced load, Nginx will actively disconnect every 30 minutes:

http {

    # Add connection timeout settings
    keepalive_timeout 30m;

    server {

        location / {
            grpc_pass grpc://unix:/run/spire/sockets/agent.sock;

            grpc_read_timeout 30m;
            grpc_send_timeout 30m;
        }
    }
}

I observed that within a certain period, Envoy will reconnect to SDS and obtain the latest certificates, but after a few hours, it will not actively reconnect:

curl -s localhost:15000/certs

...
{
"ca_cert": [
{
"path": "server.example.com: \u003cinline\u003e",
"serial_number": "ef3b9efd678541c6ab83e8839631e34a",
"subject_alt_names": [
{
"uri": "spiffe://server.example.com"
}
],
"days_until_expiration": "32",
"valid_from": "2024-06-19T08:27:44Z",
"expiration_time": "2024-08-18T08:27:54Z"
}
],
"cert_chain": [

THIS CERTIFICATE HAS EXPIRED !!!

{
"path": "\u003cinline\u003e",
"serial_number": "ea6155384a2469a8059a89090cc00c51",
"subject_alt_names": [
{
"uri": "spiffe://server.example.com/istio-proxy"
}
],
"days_until_expiration": "0",
"valid_from": "2024-07-16T07:33:18Z",
"expiration_time": "2024-07-16T09:33:28Z"
}
]
...

After this, envoy will no longer establish a connection with spire-sds.This results in subsequent requests failing the handshake:

rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED

Envoy version: 1.19.4 Envoy configuration:

transport_socket:
  name: envoy.transport_sockets.tls
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
    common_tls_context:
      tls_certificate_sds_secret_configs:
        - name: "spiffe://server.example.com/istio-proxy"
          sds_config:
            resource_api_version: V3
            initial_fetch_timeout: "120s"
            api_config_source:
              api_type: GRPC
              transport_api_version: V3
              grpc_services:
                - google_grpc:
                    target_uri: spire-agent-proxy.spire.svc.cluster.local:9001
                    stat_prefix: spire_agent
      combined_validation_context:
        default_validation_context:
          match_subject_alt_names:
            - exact: "spiffe://server.example.com/istio-proxy"
        validation_context_sds_secret_config:
          name: "spiffe://server.example.com"
          sds_config:
            resource_api_version: V3
            initial_fetch_timeout: "120s"
            api_config_source:
              api_type: GRPC
              transport_api_version: V3
              grpc_services:
                - google_grpc:
                    target_uri: spire-agent-proxy.spire.svc.cluster.local:9001
                    stat_prefix: spire_agent
      tls_params:
        ecdh_curves:
          - X25519:P-256:P-521:P-384

Is this behavior expected? Does Envoy detect certificate expiration and actively initiate a certificate pull request?

KBaichoo commented 1 month ago

cc @adisuissa

I don't have much experience with SDS, but I'm curious why we end up not reconnecting to the SDS cluster? What do the stats for the SDS cluster say and what is its configuration?

ktalg commented 1 month ago

cc @adisuissa

I don't have much experience with SDS, but I'm curious why we end up not reconnecting to the SDS cluster? What do the stats for the SDS cluster say and what is its configuration?

I conducted another test, configuring mTLS at the downstream transport_socket.

...
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
            common_tls_context:
              combined_validation_context:
                default_validation_context:
                  allow_expired_certificate: true
                  match_subject_alt_names:
                  - exact: spiffe://server.example.com/istio-proxy
                validation_context_sds_secret_config:
                  name: spiffe://server.example.com
                  sds_config:
                    api_config_source:
                      api_type: GRPC
                      grpc_services:
                      - google_grpc:
                          stat_prefix: spire_agent
                          target_uri: spire-agent-proxy.spire.svc.cluster.local:9001
                      transport_api_version: V3
                    initial_fetch_timeout: 120s
                    resource_api_version: V3
              tls_certificate_sds_secret_configs:
              - name: spiffe://server.example.com/istio-proxy
                sds_config:
                  api_config_source:
                    api_type: GRPC
                    grpc_services:
                    - google_grpc:
                        stat_prefix: spire_agent
                        target_uri: spire-agent-proxy.spire.svc.cluster.local:9001
                    transport_api_version: V3
                  initial_fetch_timeout: 120s
                  resource_api_version: V3
              tls_params:
                ecdh_curves:
                - X25519:P-256:P-521:P-384
            require_client_certificate: true
            ...

From the monitoring data, the connection was actively terminated by self at the 18-minute mark, which is quite STRANGE!

image

By the way, I replaced the Spire SDS proxy from Nginx to Envoy to output more detailed monitoring logs.. Here are the Spire-Envoy logs:

{"authority":"spire-agent-proxy.spire.svc.cluster.local:9001","bytes_received":"32145","bytes_sent":"4430","connection_termination_details":"-","downstream_local_address":"10.11.129.239:9001","downstream_remote_address":"172.17.8.199:46906","duration":"1126536","method":"POST","path":"/envoy.service.secret.v3.SecretDiscoveryService/StreamSecrets","protocol":"HTTP/2","request_id":"00e14d3e-9962-47a0-952d-ecdc8c461f29","requested_server_name":"-","response_code":"200","response_code_details":"downstream_remote_disconnect","response_flags":"DC","route_name":"-","start_time":"2024-07-18T05:27:56.143Z","upstream_cluster":"local","upstream_host":"/run/spire/sockets/agent.sock","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"grpc-c++/1.39.0-dev grpc-c/17.0.0 (linux; chttp2)","x_forwarded_for":"-"}
{"authority":"spire-agent-proxy.spire.svc.cluster.local:9001","bytes_received":"48143","bytes_sent":"39330","connection_termination_details":"-","downstream_local_address":"10.11.129.239:9001","downstream_remote_address":"172.17.8.199:46906","duration":"1126538","method":"POST","path":"/envoy.service.secret.v3.SecretDiscoveryService/StreamSecrets","protocol":"HTTP/2","request_id":"8b89c61d-8883-4525-bdab-10cd5a15b3f1","requested_server_name":"-","response_code":"200","response_code_details":"downstream_remote_disconnect","response_flags":"DC","route_name":"-","start_time":"2024-07-18T05:27:56.140Z","upstream_cluster":"local","upstream_host":"/run/spire/sockets/agent.sock","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"grpc-c++/1.39.0-dev grpc-c/17.0.0 (linux; chttp2)","x_forwarded_for":"-"}

and Spire-Envoy‘s config:

    static_resources:
      listeners:
        - name: listener_0
          address:
            socket_address: { address: 0.0.0.0, port_value: 9001 }
          filter_chains:
            - filters:
                - name: envoy.filters.network.http_connection_manager
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                    stat_prefix: ingress_http
                    codec_type: AUTO
                    access_log:
                      - name: envoy.access_loggers.file
                        typed_config:
                          '@type': type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                          log_format:
                            text_format: |
                              {"authority":"%REQ(:AUTHORITY)%","bytes_received":"%BYTES_RECEIVED%","bytes_sent":"%BYTES_SENT%","connection_termination_details":"%CONNECTION_TERMINATION_DETAILS%","downstream_local_address":"%DOWNSTREAM_LOCAL_ADDRESS%","downstream_remote_address":"%DOWNSTREAM_REMOTE_ADDRESS%","duration":"%DURATION%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%","protocol":"%PROTOCOL%","request_id":"%REQ(X-REQUEST-ID)%","requested_server_name":"%REQUESTED_SERVER_NAME%","response_code":"%RESPONSE_CODE%","response_code_details":"%RESPONSE_CODE_DETAILS%","response_flags":"%RESPONSE_FLAGS%","route_name":"%ROUTE_NAME%","start_time":"%START_TIME%","upstream_cluster":"%UPSTREAM_CLUSTER%","upstream_host":"%UPSTREAM_HOST%","upstream_local_address":"%UPSTREAM_LOCAL_ADDRESS%","upstream_service_time":"%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%","upstream_transport_failure_reason":"%UPSTREAM_TRANSPORT_FAILURE_REASON%","user_agent":"%REQ(USER-AGENT)%","x_forwarded_for":"%REQ(X-FORWARDED-FOR)%"}
                          path: /dev/stdout
                    route_config:
                      name: local_route
                      virtual_hosts:
                        - name: local_service
                          domains: [ "*" ]
                          routes:
                            - match: { prefix: "/" }
                              route: { cluster: local }
                    http_filters:
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                    stream_idle_timeout: 1800s
      clusters:
        - name: local
          connect_timeout: 0.25s
          http2_protocol_options:
            max_concurrent_streams: 1
          common_http_protocol_options:
            idle_timeout: 60s
          type: STATIC
          lb_policy: ROUND_ROBIN
          load_assignment:
            cluster_name: local
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        pipe:
                          path: /run/spire/sockets/agent.sock
ktalg commented 1 month ago

What's even stranger is that if I set the stream timeout to 15 minutes in the Spire-Envoy configuration

...
                    route_config:
                      name: local_route
                      virtual_hosts:
                        - name: local_service
                          domains: [ "*" ]
                          routes:
                            - match: { prefix: "/" }
                              route: { cluster: local }
                    http_filters:
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                    stream_idle_timeout: 900s <<<<<<<<<<<<<<<changed
      clusters:
        - name: local
          connect_timeout: 0.25s
          http2_protocol_options:
            max_concurrent_streams: 1
          common_http_protocol_options:
            idle_timeout: 60s
 ...

the SDS connection will be forcibly terminated after the timeout, causing Envoy to reconnect. This way, the connection can be maintained continuously. This is very confusing to me. image

My Envoy establishes a connection with SDS without using TLS (I know this is not secure, but please ignore this for now). Could this be causing the issue? envoy --(tcp)--> envoy(proxy) --(uds)--> spire-sds

adisuissa commented 1 month ago

General information: From the segments of the config it is a bit unclear what's going on, so I'll try to rephrase the configuration and if we agree we can go from there. The configuration uses GoogleGrpc in order to connect to the SDS server. No ADS is used, but IIUC all sds_config segments use the same configuration server. I'm assuming that these are all static resources, so Envoy should create the xDS-connection, and fetch the SDS when it starts up. Note that this should happen even before the first data-plane request. Moreover, the xDS path should be decoupled from the data-plane path, as this is not an on-demand service, but rather xDS is a pub/sub where Envoy should fetch all the certificates at the beginning. Whenever something is updated, it will be pushed by the server to the Envoy.

When an xDS connection is reset, Envoy will attempt to reconnect (exponential backoff). Using EnvoyGrpc allows one to control the knobs of that backoff. GoogleGrpc might have a similar option in the underlying lib, but I'm not sure.

I observed that within a certain period, Envoy will reconnect to SDS and obtain the latest certificates, but after a few hours, it will not actively reconnect

Seems strange. I suggest increasing the log-level and seeing whether Envoy is actually not trying to connect, or trying and failing. I guess the former, but then I wonder if Envoy is aware that the xDS-connection was terminated (or was it masked by the gRPC library).

From the last comments, it doesn't seem that the issue is when Envoy disconnects from the SDS server, but rather when a Envoy data-plane disconnects from an upstream, upon reconnection to the upstream, Envoy will not find the certificate, although it should already have it.

What's even stranger is that if I set the stream timeout to 15 minutes in the Spire-Envoy configuration the SDS connection will be forcibly terminated after the timeout, causing Envoy to reconnect.

I'm not familiar with this, but can you provide details of which stream idle timeout is modified (is it on the data-plane config or the xDS-connection related config)?

ktalg commented 1 month ago

General information: From the segments of the config it is a bit unclear what's going on, so I'll try to rephrase the configuration and if we agree we can go from there. The configuration uses GoogleGrpc in order to connect to the SDS server. No ADS is used, but IIUC all sds_config segments use the same configuration server. I'm assuming that these are all static resources, so Envoy should create the xDS-connection, and fetch the SDS when it starts up. Note that this should happen even before the first data-plane request. Moreover, the xDS path should be decoupled from the data-plane path, as this is not an on-demand service, but rather xDS is a pub/sub where Envoy should fetch all the certificates at the beginning. Whenever something is updated, it will be pushed by the server to the Envoy.

When an xDS connection is reset, Envoy will attempt to reconnect (exponential backoff). Using EnvoyGrpc allows one to control the knobs of that backoff. GoogleGrpc might have a similar option in the underlying lib, but I'm not sure.

I observed that within a certain period, Envoy will reconnect to SDS and obtain the latest certificates, but after a few hours, it will not actively reconnect

Seems strange. I suggest increasing the log-level and seeing whether Envoy is actually not trying to connect, or trying and failing. I guess the former, but then I wonder if Envoy is aware that the xDS-connection was terminated (or was it masked by the gRPC library).

From the last comments, it doesn't seem that the issue is when Envoy disconnects from the SDS server, but rather when a Envoy data-plane disconnects from an upstream, upon reconnection to the upstream, Envoy will not find the certificate, although it should already have it.

What's even stranger is that if I set the stream timeout to 15 minutes in the Spire-Envoy configuration the SDS connection will be forcibly terminated after the timeout, causing Envoy to reconnect.

I'm not familiar with this, but can you provide details of which stream idle timeout is modified (is it on the data-plane config or the xDS-connection related config)?

After extensive testing, I can confirm the issue details:

  1. Envoy runs in K8s and Istio, version: 1.19.4
  2. Client Envoy accesses Spire via K8s Service address. Spire has an Envoy sidecar forwarding SDS messages to UDS.
  3. Every 20 minutes, Spire-envoy actively closes the stream (using stream_idle_timeout). Oddly, it logs a DC access log, seemingly interpreting it as a client-initiated disconnect. The client container shows the connection to Spire-envoy (via Service address) still exists and persists, though it's actually dead! This should be the reason why the client will not actively reconnect.
  4. Adding these keepalive parameters to the client resolved the issue:
channel_args:
  args:
    grpc.http2.max_pings_without_data: 
      int_value: 0
    grpc.keepalive_permit_without_calls:
      int_value: 1
    grpc.keepalive_time_ms: 
      int_value: 180000
    grpc.keepalive_timeout_ms: 
      int_value: 2000

This allows the client to detect and reconnect, avoiding the problem.

  1. Setting a smaller stream_idle_timeout earlier likely had a similar effect, enabling quicker connection status detection.

This behavior confirms that google_grpc without keepalive can cause connection leaks. The issue may involve "idle timeout" configurations in the network environment. I've checked kernel parameters but found nothing relevant.

github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.