Active health check for LOGICAL_DNS cluster is failing

psrin7 commented 6 years ago

Active healthcheck for cluster is failing

Active healthcheck is failing though I am getting response from the upstream. Not sure if I am missing anything in the configuration.

Admin and Stats Output: cluster.sample-cluster.bind_errors: 0 cluster.sample-cluster.external.upstream_rq_200: 11 cluster.sample-cluster.external.upstream_rq_2xx: 11 cluster.sample-cluster.external.upstream_rq_301: 1 cluster.sample-cluster.external.upstream_rq_3xx: 1 cluster.sample-cluster.health_check.attempt: 12 cluster.sample-cluster.health_check.failure: 12 cluster.sample-cluster.health_check.healthy: 0 cluster.sample-cluster.health_check.network_failure: 12 cluster.sample-cluster.health_check.passive_failure: 0 cluster.sample-cluster.health_check.success: 0 cluster.sample-cluster.health_check.verify_cluster: 0 cluster.sample-cluster.lb_healthy_panic: 12 cluster.sample-cluster.lb_local_cluster_not_ok: 0 cluster.sample-cluster.lb_recalculate_zone_structures: 0 cluster.sample-cluster.lb_subsets_active: 0 cluster.sample-cluster.lb_subsets_created: 0 cluster.sample-cluster.lb_subsets_fallback: 0 cluster.sample-cluster.lb_subsets_removed: 0 cluster.sample-cluster.lb_subsets_selected: 0 cluster.sample-cluster.lb_zone_cluster_too_small: 0 cluster.sample-cluster.lb_zone_no_capacity_left: 0 cluster.sample-cluster.lb_zone_number_differs: 0 cluster.sample-cluster.lb_zone_routing_all_directly: 0 cluster.sample-cluster.lb_zone_routing_cross_zone: 0 cluster.sample-cluster.lb_zone_routing_sampled: 0 cluster.sample-cluster.max_host_weight: 0 cluster.sample-cluster.membership_change: 1 cluster.sample-cluster.membership_healthy: 0 cluster.sample-cluster.membership_total: 1

Config:

node:
  id: some-node
  cluster: default-cluster
  locality:
        zone: default-zone

admin:
  access_log_path: /dev/stdout
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 10000 }
    filter_chains:
      filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: sample-cluster
          http_filters:
          - name: envoy.router
  clusters:
  - name: sample-cluster
    type: LOGICAL_DNS
    connect_timeout: 0.25s
    lb_policy: ROUND_ROBIN
    dns_lookup_family: V4_ONLY
    hosts:
    - socket_address:
        protocol: TCP
        address: some-domain.com  #have changed it for privacy concern
        port_value: 80
    health_checks:
      timeout: 2s
      interval: 5s
      interval_jitter: 1s
      unhealthy_threshold: 1
      healthy_threshold: 3
      no_traffic_interval: 60s
      event_log_path: /dev/stdout
      http_health_check:
        path: /

Logs: $ [2018-07-19 20:54:09.457][19][info][main] source/server/server.cc:183] initializing epoch 0 (hot restart version=10.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363 size=2654312) [2018-07-19 20:54:09.457][19][info][main] source/server/server.cc:185] statically linked extensions: [2018-07-19 20:54:09.457][19][info][main] source/server/server.cc:187] access_loggers: envoy.file_access_log,envoy.http_grpc_access_log [2018-07-19 20:54:09.457][19][info][main] source/server/server.cc:190] filters.http: envoy.buffer,envoy.cors,envoy.ext_authz,envoy.fault,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.rbac,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash [2018-07-19 20:54:09.457][19][info][main] source/server/server.cc:193] filters.listener: envoy.listener.original_dst,envoy.listener.proxy_protocol,envoy.listener.tls_inspector [2018-07-19 20:54:09.457][19][info][main] source/server/server.cc:196] filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.thrift_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy [2018-07-19 20:54:09.458][19][info][main] source/server/server.cc:198] stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd [2018-07-19 20:54:09.458][19][info][main] source/server/server.cc:200] tracers: envoy.dynamic.ot,envoy.lightstep,envoy.zipkin [2018-07-19 20:54:09.458][19][info][main] source/server/server.cc:203] transport_sockets.downstream: envoy.transport_sockets.capture,raw_buffer,tls [2018-07-19 20:54:09.458][19][info][main] source/server/server.cc:206] transport_sockets.upstream: envoy.transport_sockets.capture,raw_buffer,tls [2018-07-19 20:54:09.463][19][debug][main] source/server/server.cc:234] admin address: 0.0.0.0:9901 [2018-07-19 20:54:09.464][19][info][config] source/server/configuration_impl.cc:50] loading 0 static secret(s) [2018-07-19 20:54:09.465][22][debug][grpc] source/common/grpc/google_async_client_impl.cc:39] completionThread running [2018-07-19 20:54:09.466][19][debug][upstream] source/common/upstream/cluster_manager_impl.cc:707] adding TLS initial cluster sample-cluster [2018-07-19 20:54:09.466][19][debug][upstream] source/common/upstream/logical_dns_cluster.cc:70] starting async DNS resolution for some-domain.com [2018-07-19 20:54:09.466][19][debug][upstream] source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 5000 milliseconds [2018-07-19 20:54:09.466][19][debug][upstream] source/common/upstream/cluster_manager_impl.cc:61] cm init: adding: cluster=sample-cluster primary=1 secondary=0 [2018-07-19 20:54:09.466][19][info][config] source/server/configuration_impl.cc:60] loading 1 listener(s) [2018-07-19 20:54:09.466][19][debug][config] source/server/configuration_impl.cc:62] listener #0: [2018-07-19 20:54:09.466][19][debug][config] source/server/listener_manager_impl.cc:528] begin add/update listener: name=listener_0 hash=16491985507912357005 [2018-07-19 20:54:09.466][19][debug][config] source/server/listener_manager_impl.cc:38] filter #0: [2018-07-19 20:54:09.466][19][debug][config] source/server/listener_manager_impl.cc:39] name: envoy.http_connection_manager [2018-07-19 20:54:09.466][19][debug][config] source/server/listener_manager_impl.cc:42] config: {"http_filters":[{"name":"envoy.router"}],"route_config":{"virtual_hosts":[{"name":"local_service","domains":["*"],"routes":[{"match":{"prefix":"/"},"route":{"cluster":"sample-cluster"}}]}],"name":"local_route"},"stat_prefix":"ingress_http","codec_type":null} [2018-07-19 20:54:09.468][19][debug][config] source/extensions/filters/network/http_connection_manager/config.cc:279] http filter #0 [2018-07-19 20:54:09.468][19][debug][config] source/extensions/filters/network/http_connection_manager/config.cc:280] name: envoy.router [2018-07-19 20:54:09.468][19][debug][config] source/extensions/filters/network/http_connection_manager/config.cc:284] config: {} [2018-07-19 20:54:09.468][19][debug][config] source/server/listener_manager_impl.cc:414] add active listener: name=listener_0, hash=16491985507912357005, address=0.0.0.0:10000 [2018-07-19 20:54:09.468][19][info][config] source/server/configuration_impl.cc:94] loading tracing configuration [2018-07-19 20:54:09.468][19][info][config] source/server/configuration_impl.cc:116] loading stats sink configuration [2018-07-19 20:54:09.468][19][info][main] source/server/server.cc:410] starting main dispatch loop [2018-07-19 20:54:09.468][19][debug][upstream] source/common/upstream/logical_dns_cluster.cc:78] async DNS resolution complete for some-domain.com [2018-07-19 20:54:09.468][19][debug][client] source/common/http/codec_client.cc:25] [C0] connecting [2018-07-19 20:54:09.468][19][debug][connection] source/common/network/connection_impl.cc:570] [C0] connecting to 0.0.0.0:0 [2018-07-19 20:54:09.468][19][debug][connection] source/common/network/connection_impl.cc:579] [C0] connection in progress [2018-07-19 20:54:09.472][19][debug][connection] source/common/network/connection_impl.cc:475] [C0] delayed connection error: 111 [2018-07-19 20:54:09.472][19][debug][connection] source/common/network/connection_impl.cc:133] [C0] closing socket: 0 [2018-07-19 20:54:09.472][19][debug][client] source/common/http/codec_client.cc:81] [C0] disconnect. resetting 1 pending requests [2018-07-19 20:54:09.472][19][debug][client] source/common/http/codec_client.cc:104] [C0] request reset [2018-07-19 20:54:09.472][19][debug][hc] source/common/upstream/health_checker_impl.cc:170] [C0] connection/stream error health_flags=healthy [2018-07-19 20:54:09.472][19][debug][upstream] source/common/upstream/cluster_manager_impl.cc:844] membership update for TLS cluster sample-cluster {"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"0.0.0.0","resolver_name":"","ipv4_compat":false,"port_value":0}},"cluster_name":"sample-cluster","eject_unhealthy_event":{"failure_type":"NETWORK"}} [2018-07-19 20:54:09.472][19][debug][upstream] source/common/upstream/cluster_manager_impl.cc:89] cm init: init complete: cluster=sample-cluster primary=0 secondary=0

curl -v http://some-domain.com:80

About to connect() to some-domain.com port 80 (#0)
Trying 148.x.x.x...
Connected to some-domain.com (148.x.x.x) port 80 (#0)

GET / HTTP/1.1 User-Agent: curl/7.29.0 Host: some-domain.com Accept: /

< HTTP/1.1 301 Moved Permanently < Content-Type: text/html; charset=utf-8 < Location: /ui/ < Date: Thu, 19 Jul 2018 21:11:50 GMT < Content-Length: 39 < Moved Permanently.
Connection #0 to host some-domain.com left intact

danielhochman commented 6 years ago

Envoy expects a 200 status code in the response for the healthcheck to succeed. See Architecture Overview: Healthchecking.

Your example curl is returning 301.

psrin7 commented 6 years ago

I have also tested with health check returning 200. It made no difference.

On Thu, Jul 19, 2018, 4:19 PM Daniel Hochman notifications@github.com wrote:

Envoy expects a 200 status code in the response for the healthcheck to succeed. See Architecture Overview: Healthchecking https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/health_checking#arch-overview-health-checking .

Your example curl is returning 301.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/3908#issuecomment-406442714, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ47GrhKofm-qY84iEOBSee9mETOA1daks5uIRP_gaJpZM4VXG1O .

psrin7 commented 6 years ago

What intrigues me is that the health check event shows address as 0.0.0.0 and port 0. It isn't using what is defined in the cluster's socket_address. It also indicates failure_type as NETWORK, which isn't helpful much. Thx.

Here is the health check event log. {"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"0.0.0.0","resolver_name":"","ipv4_compat":false,"port_value":0}},"cluster_name":"sample-cluster","eject_unhealthy_event":{"failure_type":"NETWORK"}}

ghost commented 6 years ago

@psrin7 ,

can you refer this config and try. And hook this to a service which has /health URI

node: id: nodexxx cluster: dc1 admin: access_log_path: /tmp/admin_access.log address: socket_address: { address: 0.0.0.0, port_value: 8001 }

static_resources: listeners:

name: listener_0 address: socket_address: { address: 0.0.0.0, port_value: 8005 } filter_chains:
- filters:
  - name: envoy.http_connection_manager config: stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts:
    - name: local_service domains: ["*"] routes:
      - match: { prefix: "/greetings" } route: { cluster: envoyservice-1 } http_filters:
    - name: envoy.router clusters:
name: envoyservice-1 connect_timeout: 0.25s lb_policy: ROUND_ROBIN type: STATIC hosts: [{ socket_address: { address: x.x.x.x, port_value: 2236 }}] health_checks:
- timeout: 1s unhealthy_threshold : 2 healthy_threshold: 1 interval: 1s http_health_check: {path: "/health"}

psrin7 commented 6 years ago

@vgomprakash - Thanks for you inputs. The above configuration works. The caveat is that the type (cluster.discoverytype) is STATIC and you need to provide an IP address. But, I have a need to use DNS. Since, I am using DNS for the hosts address, I have useSTRICT_DNS or LOGICAL_DNS. When I change the type to STRICT_DNS, the active health checking is working. But, it isn't working with LOGICAL_DNS as the type.

cetanu commented 6 years ago

I can confirm that switching from logical_dns to strict_dns resulted in my active health checks working.

The behaviour I observed is that the envoy proxy is not sending packets at all when set to logical_dns. I suspect this may be because it resolves the hosts as '0.0.0.0', as this is what shows up in the stats endpoint on the envoy admin ui, plus in the debug logs. This may be a symptom of what's going on.

envoyproxy / envoy

Active health check for LOGICAL_DNS cluster is failing #3908