Closed shonecyx closed 2 months ago
cc @alyssawilk @zuercher @ggreenway (tcp_proxy)
Sorry is the problem that doing a listener update is causing problems? have you looked into ecds to update your network filter config without reloading listeners? cc @adisuissa
Thanks @alyssawilk for responding. Unfortunately there is no ECDS support in istio as per https://github.com/istio/istio/issues/37172. Are you suggesting the using ECDS can avoid listener draining and listener filter draining? cc @howardjohn
I don't think envoy can configure access log over ECDS. IIUC this is about listener access log?
AFAIK there are certain Listener updates that do not replace the listener, just update in-place. It may be possible to add this functionality for access log updates.
Not only AccessLog but also AuthorizationPolicy, we got listener drain for AccessLog change and listener_filter_drain for TCP AuthorizationPolicy and both got data plane impact. @howardjohn
I'm not sure how AuthorizationPolicy is being mapped to the xDS API, but if this is part of the network filter, then using ECDS is probably the right way to go.
Its an rbac network filter. FWIW we have discussed that in Istio and its somewhat considered intentionally to drain on RBAC change to ensure we don't have old connections that are no longer accepted by the policies (IDK if this is 100% valid, TBH, but its not universally better to use ECDS)
@howardjohn What about for the listener drain to not close the connection? For that scenario is it better to use ECDS to avoid the data plane impact?
@howardjohn If there is no plan for ECDS in Istio, is it possible to add an annotation(something like applyAfterTimeStamp) in EnvoyFilter to only apply the EnvoyFilter to new pod created, then we can avoid immediate data plane impact to existing sidecars.
AFAIK there are certain Listener updates that do not replace the listener, just update in-place. It may be possible to add this functionality for access log updates.
@adisuissa For the access log caused listener drain, is there also function gap in envoy? Need to support listener in-place update for it?
Generally speaking, listener in-place replacement is discouraged (long history in Envoy, e.g., #21059, #20100, #16177, #12748). So if there are ways to achieve the requested feature, then they should preferred over this.
If there are specific fields that are updated and known not to cause issues, then it may be possible to add this kind of support (probably following up on changes made in #10662). Can you please add more context on this issue, such as which fields need to be updated without draining? Will it be possible to provide the relevant Envoy config? Looking more closely at the Istio bug you've linked it seems that this is not a listener's access-log update, but an HCM filter update, is this correct? If so, it does seem that using ECDS is the right way to go here.
Thanks @adisuissa for the details. We have different user cases that need to update the access-log and the massive listener drain or listener filter drain cause data plane impact.
Case 1: Add new access log fields
I.e we need to add some custom fileds like "rlog_id": "%RESP(RLOGID)%",
For this case it's not in HCM but it caused listener drain and all the outbound TCP connections got reset then re-established:
{
"name": "virtualOutbound",
"active_state": {
"version_info": "2024-07-16T20:32:31Z/54545",
"listener": {
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "virtualOutbound",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 15001
}
},
"filter_chains": [
{
"filter_chain_match": {
"destination_port": 15001
},
"filters": [
{
"name": "istio.stats",
"typed_config": {
"@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
"type_url": "type.googleapis.com/stats.PluginConfig",
"value": {
"metrics": [
{
"tags_to_remove": [
"response_flags",
"source_version",
"source_canonical_service",
"source_canonical_revision",
"source_cluster",
"source_principal",
"destination_version",
"destination_canonical_service",
"destination_canonical_revision",
"destination_cluster",
"destination_principal"
]
}
],
"response_code_by_category": true
}
}
},
{
"name": "envoy.filters.network.tcp_proxy",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
"stat_prefix": "BlackHoleCluster",
"cluster": "BlackHoleCluster"
}
}
],
"name": "virtualOutbound-blackhole"
},
{
"filters": [
{
"name": "istio.stats",
"typed_config": {
"@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
"type_url": "type.googleapis.com/stats.PluginConfig",
"value": {
"metrics": [
{
"tags_to_remove": [
"response_flags",
"source_version",
"source_canonical_service",
"source_canonical_revision",
"source_cluster",
"source_principal",
"destination_version",
"destination_canonical_service",
"destination_canonical_revision",
"destination_cluster",
"destination_principal"
]
}
],
"response_code_by_category": true
}
}
},
{
"name": "envoy.filters.network.tcp_proxy",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
"stat_prefix": "PassthroughCluster",
"cluster": "PassthroughCluster",
"access_log": [
{
"name": "envoy.access_loggers.file",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
"path": "/var/log/proxy/access.log",
"log_format": {
"json_format": {
"authority": "%REQ(:AUTHORITY)%",
"bytes_received": "%BYTES_RECEIVED%",
"bytes_sent": "%BYTES_SENT%",
"connection_termination_details": "%CONNECTION_TERMINATION_DETAILS%",
"downstream_local_address": "%DOWNSTREAM_LOCAL_ADDRESS%",
"downstream_local_uri_san": "%DOWNSTREAM_LOCAL_URI_SAN%",
"downstream_peer_issuer": "%DOWNSTREAM_PEER_ISSUER%",
"downstream_peer_subject": "%DOWNSTREAM_PEER_SUBJECT%",
"downstream_peer_uri_san": "%DOWNSTREAM_PEER_URI_SAN%",
"downstream_remote_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
"downstream_tls_cipher": "%DOWNSTREAM_TLS_CIPHER%",
"downstream_tls_version": "%DOWNSTREAM_TLS_VERSION%",
"duration": "%DURATION%",
"method": "%REQ(:METHOD)%",
"path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
"protocol": "%PROTOCOL%",
"request_id": "%REQ(X-REQUEST-ID)%",
"requested_server_name": "%REQUESTED_SERVER_NAME%",
"response_code": "%RESPONSE_CODE%",
"response_code_details": "%RESPONSE_CODE_DETAILS%",
"response_flags": "%RESPONSE_FLAGS%",
"rlog_id": "%RESP(RLOGID)%",
"route_name": "%ROUTE_NAME%",
"start_time": "%START_TIME%",
"upstream_cluster": "%UPSTREAM_CLUSTER%",
"upstream_host": "%UPSTREAM_HOST%",
"upstream_local_address": "%UPSTREAM_LOCAL_ADDRESS%",
"upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
"upstream_transport_failure_reason": "%UPSTREAM_TRANSPORT_FAILURE_REASON%",
"upstream_wire_bytes_received": "%UPSTREAM_WIRE_BYTES_RECEIVED%",
"upstream_wire_bytes_sent": "%UPSTREAM_WIRE_BYTES_SENT%",
"user_agent": "%REQ(USER-AGENT)%",
"x_forwarded_for": "%REQ(X-FORWARDED-FOR)%"
}
}
}
}
]
}
}
],
"name": "virtualOutbound-catchall-tcp"
}
Case 2: Access Log Sampling
For this case it's in HCM and we need to frequently change the percent_sampled
but the access log fileds might be changing as the same time:
{
"name": "virtualInbound",
"active_state": {
"version_info": "2024-07-16T20:32:31Z/54545",
"listener": {
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "virtualInbound",
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 15006
}
},
"filter_chains": [
{
"filter_chain_match": {
"destination_port": 8083,
"transport_protocol": "raw_buffer"
},
"filters": [
{
"name": "istio_authn",
"typed_config": {
"@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
"type_url": "type.googleapis.com/io.istio.network.authn.Config"
}
},
{
"name": "istio.metadata_exchange",
"typed_config": {
"@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
"protocol": "istio-peer-exchange"
}
},
{
"name": "envoy.filters.network.http_connection_manager",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
"stat_prefix": "inbound_10.0.0.1_8083",
"route_config": {
"name": "inbound|8083||",
"virtual_hosts": [
{
"name": "inbound|http|8083",
"domains": [
"*"
],
"routes": [
{
"match": {
"prefix": "/abc"
},
"route": {
"cluster": "inbound|8083||",
"timeout": "0s",
"max_stream_duration": {
"max_stream_duration": "0s"
}
},
"request_headers_to_add": [
{
"header": {
"key": "X-AAA-Client-IP",
"value": "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
}
},
{
"header": {
"key": "X-Client-IP",
"value": "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
}
}
],
"name": "service83"
}
],
"response_headers_to_add": [
{
"header": {
"key": "x-example-mesh-server-pod-ip",
"value": "%DOWNSTREAM_LOCAL_ADDRESS_WITHOUT_PORT%"
},
"append_action": "ADD_IF_ABSENT"
},
{
"header": {
"key": "x-example-mesh-server-duration",
"value": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%"
},
"append_action": "ADD_IF_ABSENT"
}
]
}
],
"validate_clusters": false
},
"http_filters": [
{
"name": "istio.metadata_exchange",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm",
"config": {
"vm_config": {
"runtime": "envoy.wasm.runtime.null",
"code": {
"local": {
"inline_string": "envoy.wasm.metadata_exchange"
}
}
},
"configuration": {
"@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange"
}
}
}
},
{
"name": "envoy.filters.http.fault",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault"
}
},
{
"name": "envoy.filters.http.cors",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors"
}
},
{
"name": "istio.stats",
"typed_config": {
"@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
"type_url": "type.googleapis.com/stats.PluginConfig",
"value": {
"disable_host_header_fallback": true,
"metrics": [
{
"tags_to_remove": [
"response_flags",
"source_version",
"source_canonical_service",
"source_canonical_revision",
"source_cluster",
"source_principal",
"destination_version",
"destination_canonical_service",
"destination_canonical_revision",
"destination_cluster",
"destination_principal"
]
},
{
"name": "request_bytes",
"tags_to_remove": [
"response_code"
]
},
{
"name": "response_bytes",
"tags_to_remove": [
"response_code"
]
},
{
"name": "request_duration_milliseconds",
"tags_to_remove": [
"response_code"
]
}
],
"response_code_by_category": true
}
}
},
{
"name": "envoy.filters.http.router",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
}
}
],
"tracing": {
"client_sampling": {
"value": 100
},
"random_sampling": {
"value": 1
},
"overall_sampling": {
"value": 100
},
"custom_tags": [
{
"tag": "istio.authorization.dry_run.allow_policy.name",
"metadata": {
"kind": {
"request": {}
},
"metadata_key": {
"key": "envoy.filters.http.rbac",
"path": [
{
"key": "istio_dry_run_allow_shadow_effective_policy_id"
}
]
}
}
},
{
"tag": "istio.authorization.dry_run.allow_policy.result",
"metadata": {
"kind": {
"request": {}
},
"metadata_key": {
"key": "envoy.filters.http.rbac",
"path": [
{
"key": "istio_dry_run_allow_shadow_engine_result"
}
]
}
}
},
{
"tag": "istio.authorization.dry_run.deny_policy.name",
"metadata": {
"kind": {
"request": {}
},
"metadata_key": {
"key": "envoy.filters.http.rbac",
"path": [
{
"key": "istio_dry_run_deny_shadow_effective_policy_id"
}
]
}
}
},
{
"tag": "istio.authorization.dry_run.deny_policy.result",
"metadata": {
"kind": {
"request": {}
},
"metadata_key": {
"key": "envoy.filters.http.rbac",
"path": [
{
"key": "istio_dry_run_deny_shadow_engine_result"
}
]
}
}
},
{
"tag": "istio.canonical_revision",
"literal": {
"value": "latest"
}
},
{
"tag": "istio.canonical_service",
"literal": {
"value": "wiresettlementsvccont"
}
},
{
"tag": "istio.mesh_id",
"literal": {
"value": "rnpci.tess.io"
}
},
{
"tag": "istio.namespace",
"literal": {
"value": "wiresettlementsvc-rnpci-1"
}
}
]
},
"server_name": "example server",
"access_log": [
{
"name": "envoy.access_loggers.file",
"filter": {
"and_filter": {
"filters": [
{
"not_health_check_filter": {}
},
{
"or_filter": {
"filters": [
{
"and_filter": {
"filters": [
{
"runtime_filter": {
"runtime_key": "http_ok_response_sampling_fraction",
"percent_sampled": {
"numerator": 1
},
"use_independent_randomness": true
}
},
{
"status_code_filter": {
"comparison": {
"value": {
"default_value": 200,
"runtime_key": "http_ok_response_sampling_status_eq"
}
}
}
}
]
}
},
{
"and_filter": {
"filters": [
{
"runtime_filter": {
"runtime_key": "http_err_response_sampling_fraction",
"percent_sampled": {
"numerator": 100
},
"use_independent_randomness": true
}
},
{
"or_filter": {
"filters": [
{
"status_code_filter": {
"comparison": {
"op": "LE",
"value": {
"default_value": 199,
"runtime_key": "http_err_response_sampling_status_le"
}
}
}
},
{
"status_code_filter": {
"comparison": {
"op": "GE",
"value": {
"default_value": 201,
"runtime_key": "http_err_response_sampling_status_ge"
}
}
}
}
]
}
}
]
}
}
]
}
}
]
}
},
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog",
"path": "/var/log/proxy/access.log",
"log_format": {
"json_format": {
"response_code": "%RESPONSE_CODE%",
"method": "%REQ(:METHOD)%",
"request_id": "%REQ(X-REQUEST-ID)%",
"bytes_sent": "%BYTES_SENT%",
"connection_termination_details": "%CONNECTION_TERMINATION_DETAILS%",
"requested_server_name": "%REQUESTED_SERVER_NAME%",
"downstream_tls_cipher": "%DOWNSTREAM_TLS_CIPHER%",
"downstream_peer_issuer": "%DOWNSTREAM_PEER_ISSUER%",
"downstream_peer_subject": "%DOWNSTREAM_PEER_SUBJECT%",
"upstream_host": "%UPSTREAM_HOST%",
"x_forwarded_for": "%REQ(X-FORWARDED-FOR)%",
"rlog_id": "%RESP(RLOGID)%",
"route_name": "%ROUTE_NAME%",
"user_agent": "%REQ(USER-AGENT)%",
"downstream_tls_version": "%DOWNSTREAM_TLS_VERSION%",
"response_code_details": "%RESPONSE_CODE_DETAILS%",
"duration": "%DURATION%",
"start_time": "%START_TIME%",
"authority": "%REQ(:AUTHORITY)%",
"downstream_peer_uri_san": "%DOWNSTREAM_PEER_URI_SAN%",
"path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
"response_flags": "%RESPONSE_FLAGS%",
"bytes_received": "%BYTES_RECEIVED%",
"downstream_remote_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
"downstream_local_address": "%DOWNSTREAM_LOCAL_ADDRESS%",
"upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
"upstream_local_address": "%UPSTREAM_LOCAL_ADDRESS%",
"upstream_cluster": "%UPSTREAM_CLUSTER%",
"protocol": "%PROTOCOL%",
"upstream_transport_failure_reason": "%UPSTREAM_TRANSPORT_FAILURE_REASON%",
"downstream_local_uri_san": "%DOWNSTREAM_LOCAL_URI_SAN%"
}
}
}
}
],
"use_remote_address": false,
"forward_client_cert_details": "APPEND_FORWARD",
"set_current_client_cert_details": {
"subject": true,
"dns": true,
"uri": true
},
"upgrade_configs": [
{
"upgrade_type": "websocket"
}
],
"stream_idle_timeout": "0s",
"normalize_path": true,
"request_id_extension": {
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig",
"use_request_id_for_trace_sampling": true
}
},
"path_with_escaped_slashes_action": "KEEP_UNCHANGED"
}
}
],
"name": "10.0.0.1_8083"
}
....
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Title: Avoid Envoy listener_drain and filter_chains_draining causing TCP reset
Description: We have some user cases that would apply changes to
NETWORK_FILTER
like the access log sampling mentioned here https://github.com/istio/istio/issues/51655 or some other cases to udpate the fitter_chain and after the change we observed massive listener draining as below(s:This case happend in both sidecar east west TCP connection and the egressgateway TCP connection. Here is one case for the egressgateway filter chain change and after the draing, from the tcp dump we can see it caused reset to application:
.225
is the egressgateway envoy and.139
is the app. Egressgateway sends FIN to the application while app keeps sending data then got RST.For HTTP(HCM) this is not a big concern since most cases client retry can handle this. But for the TCP(network.tcp_proxy.v3.TcpProxy) like the egressgateway case or the sidecar tcppassthrough it will cause massive connection reset in the entire mesh and cause some data plane impact. BTW the tcp_proxy network filter draining behavior is not clear here https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining
Expected behavior: Envoy NETWORK_FILTER change should not cause reset(not sending FIN to client).