envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.01k stars 4.82k forks source link

Avoid Envoy listener_drain and ​​filter_chains_draining causing TCP reset #35109

Closed shonecyx closed 2 months ago

shonecyx commented 4 months ago

Title: Avoid Envoy listener_drain and ​​filter_chains_draining causing TCP reset

Description: We have some user cases that would apply changes to NETWORK_FILTER like the access log sampling mentioned here https://github.com/istio/istio/issues/51655 or some other cases to udpate the fitter_chain and after the change we observed massive listener draining as below(s:

Screenshot 2024-07-09 at 15 42 07

Screenshot 2024-07-09 at 15 43 27

This case happend in both sidecar east west TCP connection and the egressgateway TCP connection. Here is one case for the egressgateway filter chain change and after the draing, from the tcp dump we can see it caused reset to application: .225 is the egressgateway envoy and .139 is the app. Egressgateway sends FIN to the application while app keeps sending data then got RST.

Pasted Graphic

For HTTP(HCM) this is not a big concern since most cases client retry can handle this. But for the TCP(network.tcp_proxy.v3.TcpProxy) like the egressgateway case or the sidecar tcppassthrough it will cause massive connection reset in the entire mesh and cause some data plane impact. BTW the tcp_proxy network filter draining behavior is not clear here https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining

Expected behavior: Envoy NETWORK_FILTER change should not cause reset(not sending FIN to client).

nezdolik commented 4 months ago

cc @alyssawilk @zuercher @ggreenway (tcp_proxy)

alyssawilk commented 4 months ago

Sorry is the problem that doing a listener update is causing problems? have you looked into ecds to update your network filter config without reloading listeners? cc @adisuissa

shonecyx commented 4 months ago

Thanks @alyssawilk for responding. Unfortunately there is no ECDS support in istio as per https://github.com/istio/istio/issues/37172. Are you suggesting the using ECDS can avoid listener draining and listener filter draining? cc @howardjohn

howardjohn commented 4 months ago

I don't think envoy can configure access log over ECDS. IIUC this is about listener access log?

adisuissa commented 4 months ago

AFAIK there are certain Listener updates that do not replace the listener, just update in-place. It may be possible to add this functionality for access log updates.

shonecyx commented 4 months ago

Not only AccessLog but also AuthorizationPolicy, we got listener drain for AccessLog change and listener_filter_drain for TCP AuthorizationPolicy and both got data plane impact. @howardjohn

adisuissa commented 4 months ago

I'm not sure how AuthorizationPolicy is being mapped to the xDS API, but if this is part of the network filter, then using ECDS is probably the right way to go.

howardjohn commented 4 months ago

Its an rbac network filter. FWIW we have discussed that in Istio and its somewhat considered intentionally to drain on RBAC change to ensure we don't have old connections that are no longer accepted by the policies (IDK if this is 100% valid, TBH, but its not universally better to use ECDS)

shonecyx commented 4 months ago

@howardjohn What about for the listener drain to not close the connection? For that scenario is it better to use ECDS to avoid the data plane impact?

shonecyx commented 4 months ago

@howardjohn If there is no plan for ECDS in Istio, is it possible to add an annotation(something like applyAfterTimeStamp) in EnvoyFilter to only apply the EnvoyFilter to new pod created, then we can avoid immediate data plane impact to existing sidecars.

shonecyx commented 3 months ago

AFAIK there are certain Listener updates that do not replace the listener, just update in-place. It may be possible to add this functionality for access log updates.

@adisuissa For the access log caused listener drain, is there also function gap in envoy? Need to support listener in-place update for it?

adisuissa commented 3 months ago

Generally speaking, listener in-place replacement is discouraged (long history in Envoy, e.g., #21059, #20100, #16177, #12748). So if there are ways to achieve the requested feature, then they should preferred over this.

If there are specific fields that are updated and known not to cause issues, then it may be possible to add this kind of support (probably following up on changes made in #10662). Can you please add more context on this issue, such as which fields need to be updated without draining? Will it be possible to provide the relevant Envoy config? Looking more closely at the Istio bug you've linked it seems that this is not a listener's access-log update, but an HCM filter update, is this correct? If so, it does seem that using ECDS is the right way to go here.

shonecyx commented 3 months ago

Thanks @adisuissa for the details. We have different user cases that need to update the access-log and the massive listener drain or listener filter drain cause data plane impact.

Case 1: Add new access log fields I.e we need to add some custom fileds like "rlog_id": "%RESP(RLOGID)%", For this case it's not in HCM but it caused listener drain and all the outbound TCP connections got reset then re-established:

 {
     "name": "virtualOutbound",
     "active_state": {
      "version_info": "2024-07-16T20:32:31Z/54545",
      "listener": {
       "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
       "name": "virtualOutbound",
       "address": {
        "socket_address": {
         "address": "0.0.0.0",
         "port_value": 15001
        }
       },
       "filter_chains": [
        {
         "filter_chain_match": {
          "destination_port": 15001
         },
         "filters": [
          {
           "name": "istio.stats",
           "typed_config": {
            "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
            "type_url": "type.googleapis.com/stats.PluginConfig",
            "value": {
             "metrics": [
              {
               "tags_to_remove": [
                "response_flags",
                "source_version",
                "source_canonical_service",
                "source_canonical_revision",
                "source_cluster",
                "source_principal",
                "destination_version",
                "destination_canonical_service",
                "destination_canonical_revision",
                "destination_cluster",
                "destination_principal"
               ]
              }
             ],
             "response_code_by_category": true
            }
           }
          },
          {
           "name": "envoy.filters.network.tcp_proxy",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
            "stat_prefix": "BlackHoleCluster",
            "cluster": "BlackHoleCluster"
           }
          }
         ],
         "name": "virtualOutbound-blackhole"
        },
        {
         "filters": [
          {
           "name": "istio.stats",
           "typed_config": {
            "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
            "type_url": "type.googleapis.com/stats.PluginConfig",
            "value": {
             "metrics": [
              {
               "tags_to_remove": [
                "response_flags",
                "source_version",
                "source_canonical_service",
                "source_canonical_revision",
                "source_cluster",
                "source_principal",
                "destination_version",
                "destination_canonical_service",
                "destination_canonical_revision",
                "destination_cluster",
                "destination_principal"
               ]
              }
             ],
             "response_code_by_category": true
            }
           }
          },
          {
           "name": "envoy.filters.network.tcp_proxy",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
            "stat_prefix": "PassthroughCluster",
            "cluster": "PassthroughCluster",
            "access_log": [
             {
              "name": "envoy.access_loggers.file",
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
               "path": "/var/log/proxy/access.log",
               "log_format": {
                "json_format": {
                 "authority": "%REQ(:AUTHORITY)%",
                 "bytes_received": "%BYTES_RECEIVED%",
                 "bytes_sent": "%BYTES_SENT%",
                 "connection_termination_details": "%CONNECTION_TERMINATION_DETAILS%",
                 "downstream_local_address": "%DOWNSTREAM_LOCAL_ADDRESS%",
                 "downstream_local_uri_san": "%DOWNSTREAM_LOCAL_URI_SAN%",
                 "downstream_peer_issuer": "%DOWNSTREAM_PEER_ISSUER%",
                 "downstream_peer_subject": "%DOWNSTREAM_PEER_SUBJECT%",
                 "downstream_peer_uri_san": "%DOWNSTREAM_PEER_URI_SAN%",
                 "downstream_remote_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
                 "downstream_tls_cipher": "%DOWNSTREAM_TLS_CIPHER%",
                 "downstream_tls_version": "%DOWNSTREAM_TLS_VERSION%",
                 "duration": "%DURATION%",
                 "method": "%REQ(:METHOD)%",
                 "path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
                 "protocol": "%PROTOCOL%",
                 "request_id": "%REQ(X-REQUEST-ID)%",
                 "requested_server_name": "%REQUESTED_SERVER_NAME%",
                 "response_code": "%RESPONSE_CODE%",
                 "response_code_details": "%RESPONSE_CODE_DETAILS%",
                 "response_flags": "%RESPONSE_FLAGS%",
                 "rlog_id": "%RESP(RLOGID)%",
                 "route_name": "%ROUTE_NAME%",
                 "start_time": "%START_TIME%",
                 "upstream_cluster": "%UPSTREAM_CLUSTER%",
                 "upstream_host": "%UPSTREAM_HOST%",
                 "upstream_local_address": "%UPSTREAM_LOCAL_ADDRESS%",
                 "upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
                 "upstream_transport_failure_reason": "%UPSTREAM_TRANSPORT_FAILURE_REASON%",
                 "upstream_wire_bytes_received": "%UPSTREAM_WIRE_BYTES_RECEIVED%",
                 "upstream_wire_bytes_sent": "%UPSTREAM_WIRE_BYTES_SENT%",
                 "user_agent": "%REQ(USER-AGENT)%",
                 "x_forwarded_for": "%REQ(X-FORWARDED-FOR)%"
                }
               }
              }
             }
            ]
           }
          }
         ],
         "name": "virtualOutbound-catchall-tcp"
        }

Case 2: Access Log Sampling For this case it's in HCM and we need to frequently change the percent_sampled but the access log fileds might be changing as the same time:

 {
     "name": "virtualInbound",
     "active_state": {
      "version_info": "2024-07-16T20:32:31Z/54545",
      "listener": {
       "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
       "name": "virtualInbound",
       "address": {
        "socket_address": {
         "address": "0.0.0.0",
         "port_value": 15006
        }
       },
       "filter_chains": [
        {
         "filter_chain_match": {
          "destination_port": 8083,
          "transport_protocol": "raw_buffer"
         },
         "filters": [
          {
           "name": "istio_authn",
           "typed_config": {
            "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
            "type_url": "type.googleapis.com/io.istio.network.authn.Config"
           }
          },
          {
           "name": "istio.metadata_exchange",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
            "protocol": "istio-peer-exchange"
           }
          },
          {
           "name": "envoy.filters.network.http_connection_manager",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
            "stat_prefix": "inbound_10.0.0.1_8083",
            "route_config": {
             "name": "inbound|8083||",
             "virtual_hosts": [
              {
               "name": "inbound|http|8083",
               "domains": [
                "*"
               ],
               "routes": [
                {
                 "match": {
                  "prefix": "/abc"
                 },
                 "route": {
                  "cluster": "inbound|8083||",
                  "timeout": "0s",
                  "max_stream_duration": {
                   "max_stream_duration": "0s"
                  }
                 },
                 "request_headers_to_add": [
                  {
                   "header": {
                    "key": "X-AAA-Client-IP",
                    "value": "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
                   }
                  },
                  {
                   "header": {
                    "key": "X-Client-IP",
                    "value": "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
                   }
                  }
                 ],
                 "name": "service83"
                }
               ],
               "response_headers_to_add": [
                {
                 "header": {
                  "key": "x-example-mesh-server-pod-ip",
                  "value": "%DOWNSTREAM_LOCAL_ADDRESS_WITHOUT_PORT%"
                 },
                 "append_action": "ADD_IF_ABSENT"
                },
                {
                 "header": {
                  "key": "x-example-mesh-server-duration",
                  "value": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
                 },
                 "append_action": "ADD_IF_ABSENT"
                }
               ]
              }
             ],
             "validate_clusters": false
            },
            "http_filters": [
             {
              "name": "istio.metadata_exchange",
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm",
               "config": {
                "vm_config": {
                 "runtime": "envoy.wasm.runtime.null",
                 "code": {
                  "local": {
                   "inline_string": "envoy.wasm.metadata_exchange"
                  }
                 }
                },
                "configuration": {
                 "@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange"
                }
               }
              }
             },
             {
              "name": "envoy.filters.http.fault",
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault"
              }
             },
             {
              "name": "envoy.filters.http.cors",
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors"
              }
             },
             {
              "name": "istio.stats",
              "typed_config": {
               "@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
               "type_url": "type.googleapis.com/stats.PluginConfig",
               "value": {
                "disable_host_header_fallback": true,
                "metrics": [
                 {
                  "tags_to_remove": [
                   "response_flags",
                   "source_version",
                   "source_canonical_service",
                   "source_canonical_revision",
                   "source_cluster",
                   "source_principal",
                   "destination_version",
                   "destination_canonical_service",
                   "destination_canonical_revision",
                   "destination_cluster",
                   "destination_principal"
                  ]
                 },
                 {
                  "name": "request_bytes",
                  "tags_to_remove": [
                   "response_code"
                  ]
                 },
                 {
                  "name": "response_bytes",
                  "tags_to_remove": [
                   "response_code"
                  ]
                 },
                 {
                  "name": "request_duration_milliseconds",
                  "tags_to_remove": [
                   "response_code"
                  ]
                 }
                ],
                "response_code_by_category": true
               }
              }
             },
             {
              "name": "envoy.filters.http.router",
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
              }
             }
            ],
            "tracing": {
             "client_sampling": {
              "value": 100
             },
             "random_sampling": {
              "value": 1
             },
             "overall_sampling": {
              "value": 100
             },
             "custom_tags": [
              {
               "tag": "istio.authorization.dry_run.allow_policy.name",
               "metadata": {
                "kind": {
                 "request": {}
                },
                "metadata_key": {
                 "key": "envoy.filters.http.rbac",
                 "path": [
                  {
                   "key": "istio_dry_run_allow_shadow_effective_policy_id"
                  }
                 ]
                }
               }
              },
              {
               "tag": "istio.authorization.dry_run.allow_policy.result",
               "metadata": {
                "kind": {
                 "request": {}
                },
                "metadata_key": {
                 "key": "envoy.filters.http.rbac",
                 "path": [
                  {
                   "key": "istio_dry_run_allow_shadow_engine_result"
                  }
                 ]
                }
               }
              },
              {
               "tag": "istio.authorization.dry_run.deny_policy.name",
               "metadata": {
                "kind": {
                 "request": {}
                },
                "metadata_key": {
                 "key": "envoy.filters.http.rbac",
                 "path": [
                  {
                   "key": "istio_dry_run_deny_shadow_effective_policy_id"
                  }
                 ]
                }
               }
              },
              {
               "tag": "istio.authorization.dry_run.deny_policy.result",
               "metadata": {
                "kind": {
                 "request": {}
                },
                "metadata_key": {
                 "key": "envoy.filters.http.rbac",
                 "path": [
                  {
                   "key": "istio_dry_run_deny_shadow_engine_result"
                  }
                 ]
                }
               }
              },
              {
               "tag": "istio.canonical_revision",
               "literal": {
                "value": "latest"
               }
              },
              {
               "tag": "istio.canonical_service",
               "literal": {
                "value": "wiresettlementsvccont"
               }
              },
              {
               "tag": "istio.mesh_id",
               "literal": {
                "value": "rnpci.tess.io"
               }
              },
              {
               "tag": "istio.namespace",
               "literal": {
                "value": "wiresettlementsvc-rnpci-1"
               }
              }
             ]
            },
            "server_name": "example server",
            "access_log": [
             {
              "name": "envoy.access_loggers.file",
              "filter": {
               "and_filter": {
                "filters": [
                 {
                  "not_health_check_filter": {}
                 },
                 {
                  "or_filter": {
                   "filters": [
                    {
                     "and_filter": {
                      "filters": [
                       {
                        "runtime_filter": {
                         "runtime_key": "http_ok_response_sampling_fraction",
                         "percent_sampled": {
                          "numerator": 1
                         },
                         "use_independent_randomness": true
                        }
                       },
                       {
                        "status_code_filter": {
                         "comparison": {
                          "value": {
                           "default_value": 200,
                           "runtime_key": "http_ok_response_sampling_status_eq"
                          }
                         }
                        }
                       }
                      ]
                     }
                    },
                    {
                     "and_filter": {
                      "filters": [
                       {
                        "runtime_filter": {
                         "runtime_key": "http_err_response_sampling_fraction",
                         "percent_sampled": {
                          "numerator": 100
                         },
                         "use_independent_randomness": true
                        }
                       },
                       {
                        "or_filter": {
                         "filters": [
                          {
                           "status_code_filter": {
                            "comparison": {
                             "op": "LE",
                             "value": {
                              "default_value": 199,
                              "runtime_key": "http_err_response_sampling_status_le"
                             }
                            }
                           }
                          },
                          {
                           "status_code_filter": {
                            "comparison": {
                             "op": "GE",
                             "value": {
                              "default_value": 201,
                              "runtime_key": "http_err_response_sampling_status_ge"
                             }
                            }
                           }
                          }
                         ]
                        }
                       }
                      ]
                     }
                    }
                   ]
                  }
                 }
                ]
               }
              },
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
               "path": "/var/log/proxy/access.log",
               "log_format": {
                "json_format": {
                 "response_code": "%RESPONSE_CODE%",
                 "method": "%REQ(:METHOD)%",
                 "request_id": "%REQ(X-REQUEST-ID)%",
                 "bytes_sent": "%BYTES_SENT%",
                 "connection_termination_details": "%CONNECTION_TERMINATION_DETAILS%",
                 "requested_server_name": "%REQUESTED_SERVER_NAME%",
                 "downstream_tls_cipher": "%DOWNSTREAM_TLS_CIPHER%",
                 "downstream_peer_issuer": "%DOWNSTREAM_PEER_ISSUER%",
                 "downstream_peer_subject": "%DOWNSTREAM_PEER_SUBJECT%",
                 "upstream_host": "%UPSTREAM_HOST%",
                 "x_forwarded_for": "%REQ(X-FORWARDED-FOR)%",
                 "rlog_id": "%RESP(RLOGID)%",
                 "route_name": "%ROUTE_NAME%",
                 "user_agent": "%REQ(USER-AGENT)%",
                 "downstream_tls_version": "%DOWNSTREAM_TLS_VERSION%",
                 "response_code_details": "%RESPONSE_CODE_DETAILS%",
                 "duration": "%DURATION%",
                 "start_time": "%START_TIME%",
                 "authority": "%REQ(:AUTHORITY)%",
                 "downstream_peer_uri_san": "%DOWNSTREAM_PEER_URI_SAN%",
                 "path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
                 "response_flags": "%RESPONSE_FLAGS%",
                 "bytes_received": "%BYTES_RECEIVED%",
                 "downstream_remote_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
                 "downstream_local_address": "%DOWNSTREAM_LOCAL_ADDRESS%",
                 "upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
                 "upstream_local_address": "%UPSTREAM_LOCAL_ADDRESS%",
                 "upstream_cluster": "%UPSTREAM_CLUSTER%",
                 "protocol": "%PROTOCOL%",
                 "upstream_transport_failure_reason": "%UPSTREAM_TRANSPORT_FAILURE_REASON%",
                 "downstream_local_uri_san": "%DOWNSTREAM_LOCAL_URI_SAN%"
                }
               }
              }
             }
            ],
            "use_remote_address": false,
            "forward_client_cert_details": "APPEND_FORWARD",
            "set_current_client_cert_details": {
             "subject": true,
             "dns": true,
             "uri": true
            },
            "upgrade_configs": [
             {
              "upgrade_type": "websocket"
             }
            ],
            "stream_idle_timeout": "0s",
            "normalize_path": true,
            "request_id_extension": {
             "typed_config": {
              "@type": "type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig",
              "use_request_id_for_trace_sampling": true
             }
            },
            "path_with_escaped_slashes_action": "KEEP_UNCHANGED"
           }
          }
         ],
         "name": "10.0.0.1_8083"
        }
        ....
github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.