Open VishalDamgude opened 2 days ago
ConnectionId 3 looks like a failed health check request:
[2024-09-16 17:52:07.233][1][debug][hc] [source/extensions/health_checkers/grpc/health_checker_impl.cc:394] [Tags: "ConnectionId":"3"] hc grpc_status=0 service_status=serving health_flags=/failed_active_hc/pending_active_hc
until that passes the XDS requests will fail
Is there any issue with xDS cluster config?
name: xds
per_connection_buffer_limit_bytes: 32768 # 32 KiB
type: STRICT_DNS
connect_timeout: 5s
# TODO: More evaluation for policy
lb_policy: LEAST_REQUEST
load_assignment:
# TODO: add "policy" configuration
cluster_name: xds
# TODO: This must be plain text via NLB PrivateLink
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: xds.edge.svc.cluster.local.
port_value: 19000
health_checks:
- interval_jitter: 1s
unhealthy_threshold: 6
healthy_threshold: 1
event_logger:
- name: envoy.health_check.event_sinks.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.health_check.event_sinks.file.v3.HealthCheckEventFileSink
event_log_path: "/dev/stdout"
always_log_health_check_failures: true
timeout: 4s
interval: 10s
grpc_health_check:
service_name: xds:ready
#max_requests_per_connection: xxx
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 20000
max_pending_requests: 20000
max_requests: 20000
retry_budget:
budget_percent:
value: 25.0
min_retry_concurrency: 10
- priority: HIGH
max_connections: 20000
max_pending_requests: 20000
max_requests: 20000
retry_budget:
budget_percent:
value: 25.0
min_retry_concurrency: 10
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
upstream_http_protocol_options:
common_http_protocol_options:
idle_timeout: 55s
max_headers_count: 170
headers_with_underscores_action: ALLOW
explicit_http_config:
http2_protocol_options:
max_concurrent_streams: 1024
initial_stream_window_size: 65536 # 64 KiB
initial_connection_window_size: 262144 # 256 KiB
# allow_connect: ???
dns_refresh_rate: 5s
dns_failure_refresh_rate:
base_interval: 1s
max_interval: 10s
respect_dns_ttl: true
dns_lookup_family: V4_ONLY
# use_tcp_for_dns_lookups: true
track_cluster_stats:
timeout_budgets: true
request_response_sizes: true
common_lb_config:
healthy_panic_threshold:
value: 0.0
ignore_new_hosts_until_first_hc: true
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 5
keepalive_interval: 5
keepalive_time: 300
# Remove hosts as soon as they are removed from discovery.
# If this flag is set to false Envoy keeps them around until
# they become unhealthy to handlemisbehaving xDS services.
ignore_health_on_host_removal: true
We are unable to figure out why grpc health check is failing. To build out image, we have disabled few extensions. PR links:
and
But no change related to grpc extensions.
Also, we have used gcc-10 to compile the code, as we were getting errors with gcc-11 present in https://hub.docker.com/layers/envoyproxy/envoy-build-ubuntu/f94a38f62220a2b017878b790b6ea98a0f6c5f9c
It seems gcc-11 treats warnings as errors.
relevent issue raised for this: https://github.com/envoyproxy/envoy/issues/35943
I also tried with tcp healthchecks for xDS cluster.
[Tags: "ConnectionId":"3"] hc tcp healthcheck passed, health_check_address=10.89.6.2:19000
Even though tcp healthcheck passed, DiscoveryRequest request is not being sent to xDS. Only Log for xDS connection after 'Sending DiscoveryRequest' is [Tags: "ConnectionId":"3"] close during connected callback.
If I run grpcul commands from envoy container, I am able to connect with xDS
root@e8dfaba5d0ac:/etc/envoy# grpcurl -plaintext 10.89.1.2:19000 grpc.health.v1.Health/Check
{
"status": "SERVING"
}
root@e8dfaba5d0ac:/etc/envoy# grpcurl -plaintext -d '{}' 10.89.1.2:19000 envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources
ERROR:
Code: InvalidArgument
Message: type URL is required for ADS
root@e8dfaba5d0ac:/etc/envoy# grpcurl -plaintext -d '{
"node": {
"id": "e8dfaba5d0ac",
"cluster": "staging_edge_envoy_emailservice-smtp",
"user_agent_name": "envoy"
},
"type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster"
}' 10.89.1.2:19000 envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources
And xDS management server is able to respond to these grpcurl requests
OnStreamOpen {"name": "xds-edge", "streamID": 1, "typeURL": ""}
2024-09-19T18:09:37.912Z INFO edge-xds.xds-exporter xds/xds.go:503 OnStreamClosed {"name": "xds-edge", "streamID": 1}
�2024-09-19T18:15:25.326Z INFO edge-xds.xds-exporter xds/xds.go:497 OnStreamOpen {"name": "xds-edge", "streamID": 2, "typeURL": ""}
2024-09-19T18:15:25.327Z INFO edge-xds.xds-exporter xds/xds.go:508 OnStreamRequest {"name": "xds-edge", "streamID": 2, "discovery.request": {"node":{"id":"e8dfaba5d0ac","cluster":"staging_edge_envoy_emailservice-smtp","locality":null},"version":"","typeurl":"type.googleapis.com/envoy.config.cluster.v3.Cluster","respNonce":"","errorDetail":"<nil>"}}
�2024-09-19T18:15:25.328Z DEBUG edge-xds.xds-exporter v3/server.go:256 nodeID "staging_edge_envoy_emailservice-smtp" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[] and known map[]. Diff [] {"name": "xds-edge"}
�2024-09-19T18:15:25.328Z DEBUG edge-xds.xds-exporter v3/server.go:210 respond type.googleapis.com/envoy.config.cluster.v3.Cluster[] version "" with version "3094120924027038760" {"name": "xds-edge"}
2024-09-19T18:15:25.329Z INFO edge-xds.xds-exporter xds/xds.go:503 OnStreamClosed {"name": "xds-edge", "streamID": 2}
@zuercher
Title: CDS, listeners not initialized, envoy server not starting, stuck at DiscoveryRequest to xDS
Description:
CDS, listeners not initialized, envoy server not starting, stuck at DiscoveryRequest to xDS. Admin endpoint also not initialized. We see a log 'Sending DiscoveryRequest for type.googleapis.com/envoy.config.cluster.v3.Cluster' with all the extensions in this request. We see no grpc stream being established between envoy and xDS. We dont see below logs in our xDS app. Also no response logs at envoy side.
Note: All IPs are masked in attached logs. one of the xDS host ip: 10.1.1.1
[Tags: "ConnectionId":"3"] connecting to 10.1.1.1:19000
We see below logs for this connection id before sending DiscoveryRequest
[Tags: "ConnectionId":"3"] read error: Resource temporarily unavailable, code: 0 [Tags: "ConnectionId":"3"] hc grpc_status=0 service_status=serving health_flags=/failed_active_hc/pending_active_hc
Config:
Logs: