Http_health_check still doesn't work in STRICT_DNS type

yixiangop commented 5 years ago

Hello! ^_^ I have changed the type from LOGICAL_DNS to STRICT_DNS,but http_health_check still doesn't work.

The image version: envoy-alpine:v1.9.0

Config:

clusters:
  - connect_timeout: 1s
    lb_policy: LEAST_REQUEST
    hosts:
      socket_address: {address: <service_name>, port_value: <port>}
    health_checks:
    #- unhealthy_interval: 180s
    - healthy_threshold: 3
      unhealthy_threshold: 3
      #interval_jitter: 1s
      interval: 60s
      http_health_check: {path: "/health"}
      timeout: 1s
      reuse_connection: true
      event_log_path: /healthcheck.log
      always_log_health_check_failures: true
    name: <service_rest>
    type: STRICT_DNS
  - http2_protocol_options: {}
    connect_timeout: 1s
    lb_policy: LEAST_REQUEST
    hosts:
      socket_address: {address: <service_name>, port_value: <port>}
    name: <service_grpc>
    type: STRICT_DNS

Log contents:

{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"10.0.16.11","resolver_name":"","ipv4_compat":false,"port_value":"<port>"}},"cluster_name":"<service_rest>","health_check_failure_event":{"failure_type":"ACTIVE","first_check":false},"timestamp":"2019-11-01T05:35:30.584Z"}

Enter the envoy container and execute the curl command: / # curl -I :/health HTTP/1.1 200 Content-Type: <> Transfer-Encoding: chunked Date: Fri, 01 Nov 2019 06:14:30 GMT

Grafana graph:

From the graph, the health check seemed to work only for a split second after I deployed the service.

Metrics: envoy_cluster_health_check_healthy{instance="ip:port"}

Prometheus graph:

envoy_cluster_health_check_network_failure

envoy_cluster_health_check_passive_failure

So, I wonder if this is due to configuration or something else.I had looked at the envoy's official profile of the health check configuration and tried the Google, but couldn't find a solution.

Now, this may be my last hope.Thank you very much！

mattklein123 commented 5 years ago

Hard to say what is happening here without more complete logs and stats. Please provide them.

yixiangop commented 5 years ago

There are many other services with similar configurations.So I only pasted two of the route configurations.

Config:

static_resources:
  listeners:
  - address:
      socket_address: {address: 0.0.0.0, port_value: 9191}
    name: rest_listener
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          http_filters:
          - {name: envoy.router}
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - routes:
              - route:
                  cluster: configserver_rest
                  prefix_rewrite: /
                  retry_policy: {retry_on: 5xx, num_retries: 3}
                match: {prefix: /configserver/}
              - route:
                  cluster: uidgenerator_rest
                  prefix_rewrite: /
                  retry_policy: {retry_on: 5xx, num_retries: 3}
                match: {prefix: /uidgenerator/}
              name: rest_host
              domains: ['*']
            name: rest_route
          codec_type: AUTO
  - address:
      socket_address: {address: 0.0.0.0, port_value: 7676}
    name: grpc_listener
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          http_filters:
          - {name: envoy.router}
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - routes:
              - route:
                  cluster: configserver_grpc
                  prefix_rewrite: /
                  retry_policy: {retry_on: 5xx, num_retries: 3}
                match:
                  prefix: /configserver/
                  grpc: {}
              - route:
                  cluster: uidgenerator_grpc
                  prefix_rewrite: /
                  retry_policy: {retry_on: 5xx, num_retries: 3}
                match:
                  prefix: /uidgenerator/
                  grpc: {}
              name: grpc_host
              domains: ['*']
            name: grpc_route
          codec_type: AUTO
  clusters:
  - connect_timeout: 1s
    lb_policy: LEAST_REQUEST
    hosts:
      socket_address: {address: configserver, port_value: 9191}
    health_checks:
    #- unhealthy_interval: 180s
    - healthy_threshold: 3
      unhealthy_threshold: 3
      #interval_jitter: 1s
      interval: 60s
      http_health_check: {path: "/health"}
      timeout: 1s
      reuse_connection: true
      event_log_path: /work/healthcheck.log
      always_log_health_check_failures: true
    name: configserver_rest
    type: STRICT_DNS
  - http2_protocol_options: {}
    connect_timeout: 1s
    lb_policy: LEAST_REQUEST
    hosts:
      socket_address: {address: configserver, port_value: 7676}
    name: configserver_grpc
    type: STRICT_DNS
  - connect_timeout: 1s
    lb_policy: LEAST_REQUEST
    hosts:
      socket_address: {address: uidgenerator, port_value: 9191}
    health_checks:
    #- unhealthy_interval: 180s
    - healthy_threshold: 3
      unhealthy_threshold: 3
      #interval_jitter: 1s
      interval: 60s
      http_health_check: {path: "/health"}
      timeout: 1s
      reuse_connection: true
      event_log_path: /work/healthcheck.log
      always_log_health_check_failures: true
    name: uidgenerator_rest
    type: STRICT_DNS
  - http2_protocol_options: {}
    connect_timeout: 1s
    lb_policy: LEAST_REQUEST
    hosts:
      socket_address: {address: uidgenerator, port_value: 7676}
    name: uidgenerator_grpc
    type: STRICT_DNS

admin:
  address:
    socket_address: {address: 0.0.0.0, port_value: 9002}
  access_log_path: /dev/null

Healthcheck error logs are basically the same Envoy healthcheck error logs:

{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"10.0.18.11","resolver_name":"","ipv4_compat":false,"port_value":9191}},"cluster_name":"configserver_rest","health_check_failure_event":{"failure_type":"ACTIVE","first_check":false},"timestamp":"2019-11-02T02:30:00.777Z"}
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"10.0.18.19","resolver_name":"","ipv4_compat":false,"port_value":9191}},"cluster_name":"uidgenerator_rest","health_check_failure_event":{"failure_type":"ACTIVE","first_check":false},"timestamp":"2019-11-02T02:30:19.857Z"}

Envoy logs:

[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:206] initializing epoch 0 (hot restart version=10.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363 size=2654312)
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:208] statically linked extensions:
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:210]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:213]   filters.http: envoy.buffer,envoy.cors,envoy.ext_authz,envoy.fault,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.rbac,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:216]   filters.listener: envoy.listener.original_dst,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:219]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:221]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:223]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.zipkin
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:226]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.capture,raw_buffer,tls
[2019-11-01 09:48:54.819][000006][info][main] [source/server/server.cc:229]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.capture,raw_buffer,tls
[2019-11-01 09:48:54.831][000006][info][main] [source/server/server.cc:271] admin address: 0.0.0.0:9002
[2019-11-01 09:48:54.840][000006][info][config] [source/server/configuration_impl.cc:50] loading 0 static secret(s)
[2019-11-01 09:48:54.840][000006][info][config] [source/server/configuration_impl.cc:56] loading 42 cluster(s)
[2019-11-01 09:48:54.853][000006][info][config] [source/server/configuration_impl.cc:67] loading 2 listener(s)
[2019-11-01 09:48:54.857][000006][info][config] [source/server/configuration_impl.cc:92] loading tracing configuration
[2019-11-01 09:48:54.857][000006][info][config] [source/server/configuration_impl.cc:112] loading stats sink configuration
[2019-11-01 09:48:54.857][000006][info][main] [source/server/server.cc:463] starting main dispatch loop
[2019-11-01 09:49:04.856][000006][info][upstream] [source/common/upstream/cluster_manager_impl.cc:136] cm init: all clusters initialized
[2019-11-01 09:49:04.856][000006][info][main] [source/server/server.cc:435] all clusters initialized. initializing init manager
[2019-11-01 09:49:04.856][000006][info][config] [source/server/listener_manager_impl.cc:961] all dependencies initialized. starting workers
[2019-11-01 10:04:04.857][000006][info][main] [source/server/drain_manager_impl.cc:63] shutting down parent after drain

Should I provide any other helpful information?Thank you! ^_^

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

yixiangop commented 4 years ago

Although the envoy's health check never worked, the url access was available.So I came up with another way that Prometheus could use the gauge type metrics I wrote to capture the health of the service.Therefore, the metrics interface provided by envoy is no longer important.Thank you very much!

dio commented 4 years ago

@Firewall-Tomohisa what if trying a newer version of envoy? Seems like you tried envoy-alpine:v1.9.0.

yixiangop commented 4 years ago

Ok, thanks for your advice.And I will try.^_^

envoyproxy / envoy

Http_health_check still doesn't work in STRICT_DNS type #8852