envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.74k stars 4.75k forks source link

CDS, listeners not initialized, envoy server not starting #36189

Open VishalDamgude opened 2 days ago

VishalDamgude commented 2 days ago

Title: CDS, listeners not initialized, envoy server not starting, stuck at DiscoveryRequest to xDS

Description:

Custom envoy image built on top of 1.30.4 version. xDS control plane version: github.com/envoyproxy/go-control-plane v0.11.1 Trace logs enabled.

CDS, listeners not initialized, envoy server not starting, stuck at DiscoveryRequest to xDS. Admin endpoint also not initialized. We see a log 'Sending DiscoveryRequest for type.googleapis.com/envoy.config.cluster.v3.Cluster' with all the extensions in this request. We see no grpc stream being established between envoy and xDS. We dont see below logs in our xDS app. Also no response logs at envoy side.

OnStreamOpen    {"name": "xds", "streamID": 1, "typeURL": ""}
OnStreamRequest {"name": "xds", "streamID": 1, "discovery.request": {"node":{"id":"043f95f81807","cluster":"staging_envoy","locality":{"region":"unknown","zone":"unknown","sub_zone":"unknown"}},"version":"","typeurl":"type.googleapis.com/envoy.config.cluster.v3.Cluster","respNonce":"","errorDetail":"<nil>"}}

Note: All IPs are masked in attached logs. one of the xDS host ip: 10.1.1.1

[Tags: "ConnectionId":"3"] connecting to 10.1.1.1:19000

We see below logs for this connection id before sending DiscoveryRequest

[Tags: "ConnectionId":"3"] read error: Resource temporarily unavailable, code: 0 [Tags: "ConnectionId":"3"] hc grpc_status=0 service_status=serving health_flags=/failed_active_hc/pending_active_hc

Config:

Attaching sample config file. smtp-test-config.yaml.txt

Logs:

Attached log file - smtp-pod-logs smtp-pod-logs.txt

zuercher commented 2 days ago

ConnectionId 3 looks like a failed health check request:

[2024-09-16 17:52:07.233][1][debug][hc] [source/extensions/health_checkers/grpc/health_checker_impl.cc:394] [Tags: "ConnectionId":"3"] hc grpc_status=0 service_status=serving health_flags=/failed_active_hc/pending_active_hc

until that passes the XDS requests will fail

VishalDamgude commented 1 day ago

Is there any issue with xDS cluster config?

    name: xds
    per_connection_buffer_limit_bytes: 32768 # 32 KiB
    type: STRICT_DNS
    connect_timeout: 5s
    # TODO: More evaluation for policy
    lb_policy: LEAST_REQUEST
    load_assignment:
      # TODO: add "policy" configuration
      cluster_name: xds
      # TODO: This must be plain text via NLB PrivateLink
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: xds.edge.svc.cluster.local.
                port_value: 19000

    health_checks:
      - interval_jitter: 1s
        unhealthy_threshold: 6
        healthy_threshold: 1
        event_logger:
        - name: envoy.health_check.event_sinks.file
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.health_check.event_sinks.file.v3.HealthCheckEventFileSink
            event_log_path: "/dev/stdout"
        always_log_health_check_failures: true
        timeout: 4s
        interval: 10s
        grpc_health_check:
          service_name: xds:ready
    #max_requests_per_connection: xxx
    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_connections: 20000
          max_pending_requests: 20000
          max_requests: 20000
          retry_budget:
            budget_percent:
              value: 25.0
            min_retry_concurrency: 10
        - priority: HIGH
          max_connections: 20000
          max_pending_requests: 20000
          max_requests: 20000
          retry_budget:
            budget_percent:
              value: 25.0
            min_retry_concurrency: 10
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        upstream_http_protocol_options:
        common_http_protocol_options:
          idle_timeout: 55s
          max_headers_count: 170
          headers_with_underscores_action: ALLOW
        explicit_http_config:
          http2_protocol_options:
            max_concurrent_streams: 1024
            initial_stream_window_size: 65536 # 64 KiB
            initial_connection_window_size: 262144 # 256 KiB
            # allow_connect: ???
    dns_refresh_rate: 5s
    dns_failure_refresh_rate:
      base_interval: 1s
      max_interval: 10s
    respect_dns_ttl: true
    dns_lookup_family: V4_ONLY
    # use_tcp_for_dns_lookups: true
    track_cluster_stats:
      timeout_budgets: true
      request_response_sizes: true
    common_lb_config:
      healthy_panic_threshold:
        value: 0.0
      ignore_new_hosts_until_first_hc: true
    upstream_connection_options:
      tcp_keepalive:
        keepalive_probes: 5
        keepalive_interval: 5
        keepalive_time: 300
    # Remove hosts as soon as they are removed from discovery.
    # If this flag is set to false Envoy keeps them around until
    # they become unhealthy to handlemisbehaving xDS services.
    ignore_health_on_host_removal: true

We are unable to figure out why grpc health check is failing. To build out image, we have disabled few extensions. PR links:

https://github.com/freshworks/envoy/pull/35/files#diff-13060970a4ea615e3217021eae701cdf705594965e35fdba846026b2f9f92e46

and

https://github.com/freshworks/envoy/pull/35/files#diff-5294c859ad138c7624de53e0f9566c0b630f5a9522ff357d23b2f0b17d6e3991

But no change related to grpc extensions.

VishalDamgude commented 1 day ago

Also, we have used gcc-10 to compile the code, as we were getting errors with gcc-11 present in https://hub.docker.com/layers/envoyproxy/envoy-build-ubuntu/f94a38f62220a2b017878b790b6ea98a0f6c5f9c

It seems gcc-11 treats warnings as errors.

relevent issue raised for this: https://github.com/envoyproxy/envoy/issues/35943

VishalDamgude commented 1 day ago

I also tried with tcp healthchecks for xDS cluster.

[Tags: "ConnectionId":"3"] hc tcp healthcheck passed, health_check_address=10.89.6.2:19000

Even though tcp healthcheck passed, DiscoveryRequest request is not being sent to xDS. Only Log for xDS connection after 'Sending DiscoveryRequest' is [Tags: "ConnectionId":"3"] close during connected callback.

If I run grpcul commands from envoy container, I am able to connect with xDS

root@e8dfaba5d0ac:/etc/envoy# grpcurl -plaintext 10.89.1.2:19000 grpc.health.v1.Health/Check
{
  "status": "SERVING"
}

root@e8dfaba5d0ac:/etc/envoy# grpcurl -plaintext -d '{}' 10.89.1.2:19000 envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources
ERROR:
  Code: InvalidArgument
  Message: type URL is required for ADS

root@e8dfaba5d0ac:/etc/envoy# grpcurl -plaintext -d '{
  "node": {
    "id": "e8dfaba5d0ac",
    "cluster": "staging_edge_envoy_emailservice-smtp",
    "user_agent_name": "envoy"
  },
  "type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster"
}' 10.89.1.2:19000 envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources

And xDS management server is able to respond to these grpcurl requests

OnStreamOpen    {"name": "xds-edge", "streamID": 1, "typeURL": ""}
2024-09-19T18:09:37.912Z        INFO    edge-xds.xds-exporter   xds/xds.go:503  OnStreamClosed  {"name": "xds-edge", "streamID": 1}
�2024-09-19T18:15:25.326Z       INFO    edge-xds.xds-exporter   xds/xds.go:497  OnStreamOpen    {"name": "xds-edge", "streamID": 2, "typeURL": ""}
2024-09-19T18:15:25.327Z        INFO    edge-xds.xds-exporter   xds/xds.go:508  OnStreamRequest {"name": "xds-edge", "streamID": 2, "discovery.request": {"node":{"id":"e8dfaba5d0ac","cluster":"staging_edge_envoy_emailservice-smtp","locality":null},"version":"","typeurl":"type.googleapis.com/envoy.config.cluster.v3.Cluster","respNonce":"","errorDetail":"<nil>"}}
�2024-09-19T18:15:25.328Z       DEBUG   edge-xds.xds-exporter   v3/server.go:256        nodeID "staging_edge_envoy_emailservice-smtp" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[] and known map[]. Diff []  {"name": "xds-edge"}
�2024-09-19T18:15:25.328Z       DEBUG   edge-xds.xds-exporter   v3/server.go:210        respond type.googleapis.com/envoy.config.cluster.v3.Cluster[] version "" with version "3094120924027038760"     {"name": "xds-edge"}
2024-09-19T18:15:25.329Z        INFO    edge-xds.xds-exporter   xds/xds.go:503  OnStreamClosed  {"name": "xds-edge", "streamID": 2}

@zuercher