apache / apisix

The Cloud-Native API Gateway
https://apisix.apache.org/blog/
Apache License 2.0
14.54k stars 2.52k forks source link

bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

Closed alptugay closed 1 year ago

alptugay commented 1 year ago

Current Behavior

Our Apisix instance has more than 2500 route and upstream all of which has active healthcheck enabled. Sometimes (once/twice a week) we see that one or more workers use %100 of CPU. At the same time we see the following error log:

2023/07/04 03:15:53 [error] 161670#161670: *27856946269 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/routes/461091278096960700) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:*.*.*.*:8500': unlocked, context: ngx.timer, client: *.*.*.*, server: 0.0.0.0:80

We have encountered this situation on multiple instances and multiple times:

At the same exact time we can see the CPU core running this worker starts to use %100 CPU

Screenshot 2023-07-04 at 09 57 28

That Cpu Core also sees an increase in time spent

Screenshot 2023-07-04 at 10 05 47

Sockstat usage increases as well:

Screenshot 2023-07-04 at 10 07 06 Screenshot 2023-07-04 at 10 07 16

After killing the worker process it returns to normal (BTW we cant kill the process gracefully, we need to use "kill -9")

We are normally using Apisix 3.2.1 but to solve the issue we cherry picked "v3.0.0 of lua-resty-healthcheck", because the lock mechanisim seems to have changed but that caused massive memory leak so we reverted.

All of our upstreams have the following timeout and healthcheck values:

 "timeout": {
    "connect": 60,
    "send": 60,
    "read": 60
  },
  "type": "roundrobin",
  "checks": {
    "active": {
      "concurrency": 10,
      "healthy": {
        "http_statuses": [
          200,
          302
        ],
        "interval": 1,
        "successes": 2
      },
      "http_path": "/",
      "https_verify_certificate": true,
      "timeout": 1,
      "type": "tcp",
      "unhealthy": {
        "http_failures": 5,
        "http_statuses": [
          429,
          404,
          500,
          501,
          502,
          503,
          504,
          505
        ],
        "interval": 1,
        "tcp_failures": 2,
        "timeouts": 3
      }
    }
  },

There are some other abnormalities as well, I don't know if it is related or not so I'll briefly share those as well:

We see lots of upstream time out errors, however the upstreams are healthy and running. These connections seem to be related to a watch API of K8s (not sure)

2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.* , server: _, request: "GET /apis/kyverno.io/v1/clusterpolicies?allowWatchBookmarks=true&resourceVersion=2809
57361&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/kyverno.io/v1/clusterpolicies?allowWatchBookmarks=true&resourceVersion=280957361&watch=true", host: "example.com"
2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/apps/v1/deployments?allowWatchBookmarks=true&resourceVersion=280957362&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/apps/v1/deployments?allowWatchBookmarks=true&resourceVersion=280957362&watch=true", host: "example.com"
2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/wgpolicyk8s.io/v1alpha2/policyreports?allowWatchBookmarks=true&resourceVersion=280957363&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/wgpolicyk8s.io/v1alpha2/policyreports?allowWatchBookmarks=true&resourceVersion=280957363&watch=true", host: "example.com"
2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/apps/v1/replicasetsallowWatchBookmarks=true&resourceVersion=280957365&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/apps/v1/replicasets?allowWatchBookmarks=true&resourceVersion=280957365&watch=true", host: "example.com"
2023/07/04 03:15:52 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/kyverno.io/v1alpha2/admissionreports?allowWatchBookmarks=true&resourceVersi
on=280957366&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/kyverno.io/v1alpha2/admissionreports?allowWatchBookmarks=true&resourceVersion=280957366&watch=true", host: "example.com"
2023/07/04 03:15:52 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&
resourceVersion=280957366&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&resourceVersion=280957366&watch=true", host: "example.com"

We have TEngine running on parallel with Apisix, both have healthchecks enabled. But connection states are very different, for example on TEngine we see less connection in TimeWait and more inUse:

Screenshot 2023-07-04 at 10 09 22

However in Apisix we see more in Timewait and less inUse:

Screenshot 2023-07-04 at 10 30 21

When we disable healthcheck, it returns to the similar state with TEngine.

Our config.yml


apisix:
  node_listen:                      # This style support multiple ports
    - port: 80

  enable_admin: true
  enable_dev_mode: false            # Sets nginx worker_processes to 1 if set to true
  enable_reuseport: true            # Enable nginx SO_REUSEPORT switch if set to true.
  show_upstream_status_in_response_header: false # when true all upstream status write to `X-APISIX-Upstream-Status` otherwise only 5xx code
  enable_ipv6: true

  enable_server_tokens: false        # Whether the APISIX version number should be shown in Server header.

  extra_lua_path: ""                # extend lua_package_path to load third party code
  extra_lua_cpath: ""               # extend lua_package_cpath to load third party code

  ssl_session_cache_size: 100m

  geoip_shared_dict_size: 100m

  proxy_cache_distributed_shared_dict_size: 200m
  redis_healthcheck_shared_dict_size: 50m

  proxy_cache:                      # Proxy Caching configuration
    cache_ttl: 10s                  # The default caching time in disk if the upstream does not specify the cache time
    zones:                          # The parameters of a cache
      - name: disk_cache_one        # The name of the cache, administrator can specify
        memory_size: 50m            # The size of shared memory, it's used to store the cache index for
        disk_size: 1G               # The size of disk, it's used to store the cache data (disk)
        disk_path: /tmp/disk_cache_one  # The path to store the cache data (disk)
        cache_levels: 1:2           # The hierarchy levels of a cache (disk)
      - name: memory_cache
        memory_size: 50m

  delete_uri_tail_slash: false    # delete the '/' at the end of the URI
  normalize_uri_like_servlet: false
  router:
    http: radixtree_host_uri         # radixtree_uri: match route by uri(base on radixtree)
    ssl: radixtree_sni          # radixtree_sni: match route by SNI(base on radixtree)
  stream_proxy:                  # TCP/UDP proxy
    only: false                   # use stream proxy only, don't enable HTTP stuff
    tcp:                         # TCP proxy port list
    - addr: "*.*.*.*:22"
    - addr: "*.*.*.*:22"
    - addr: "1789"
    - addr: "5000"
    - addr: "6780"
    - addr: "8000"
    - addr: "8004"
    - addr: "8041"
    - addr: "8042"
    - addr: "8774"
    - addr: "8776"
    - addr: "8780"
    - addr: "8786"
    - addr: "9001"
    - addr: "9292"
    - addr: "9311"
    - addr: "9322"
    - addr: "9511"
    - addr: "9696"
    - addr: "9876"

  resolver_timeout: 5             # resolver timeout
  enable_resolv_search_opt: true  # enable search option in resolv.conf
  ssl:
    enable: true
    listen:                       # APISIX listening port in https.
      - port: 443
        enable_http2: true
    ssl_protocols: TLSv1.2 TLSv1.3
    ssl_ciphers: ECDH+AESGCM:ECDH+AES256:ECDH+AES128:DHE+AES128:!ADH:!AECDH:!MD5
    ssl_session_cache: 100m

  enable_control: true
  control:
    ip: 127.0.0.1
    port: 9090
  disable_sync_configuration_during_start: false  # safe exit. Remove this once the feature is stable
  data_encryption:                # add `encrypt_fields = { $field },` in plugin schema to enable encryption
    enable: false                 # if not set, the default value is `false`.
    keyring:
      - qeddd145sfvddff3          # If not set, will save origin value into etcd.

nginx_config:                     # config for render the template to generate nginx.conf
  user: apisix                     # specifies the execution user of the worker process.
  error_log:
    - syslog:server=unix:/dev/rsyslog,tag=lb_error_log,nohostname warn
  error_log_level:  warn          # warn,error

  enable_cpu_affinity: true       # enable cpu affinity, this is just work well only on physical machine
  worker_processes: 15          # if you want use multiple cores in container, you can inject the number of cpu as environment variable "APISIX_WORKER_PROCESSES"
  custom_cpu_affinity: |
    1111111111111110
  worker_rlimit_nofile: 1048576     # the number of files a worker process can open, should be larger than worker_connections
  worker_shutdown_timeout: 300s   # timeout for a graceful shutdown of worker processes

  max_pending_timers: 16384       # increase it if you see "too many pending timers" error
  max_running_timers: 4096        # increase it if you see "lua_max_running_timers are not enough" error

  event:
    worker_connections: 500000

  meta:
    lua_shared_dict:
      prometheus-metrics: 15m

  stream:
    enable_access_log: true         # enable access log or not, default true
    access_log:
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_2xx3xx,nohostname,severity=info main if=$status_2xx3xx
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_non_2xx3xx,nohostname,severity=info main if=$status_non_2xx3xx
    access_log_format: '{"bytes_sent":"$bytes_sent","connection":"$connection","protocol":"$protocol","remote_addr":"$remote_addr","remote_port":"$remote_port","server_addr":"$server_addr","server_port":"$server_port","session_time":"$session_time","ssl_server_name":"$ssl_server_name","status":"$status","upstream_addr":"$upstream_addr","upstream_bytes_received":"$upstream_bytes_received","upstream_bytes_sent":"$upstream_bytes_sent","upstream_connect_time":"$upstream_connect_time","upstream_session_time":"$upstream_session_time"}'
    access_log_format_escape: json          # allows setting json or default characters escaping in variables
    lua_shared_dict:
      etcd-cluster-health-check-stream: 10m
      lrucache-lock-stream: 10m
      plugin-limit-conn-stream: 10m
      upstream-healthcheck-stream: 100m

  main_configuration_snippet: |
  http_configuration_snippet: |
    map_hash_max_size 20480;
    map_hash_bucket_size 20480;
    map $http_cf_ipcountry $ipcountry { ""                1; default           0; tr                1; }

    map $status $status_2xx3xx {
        ~^[23]  1;
        default 0;
    }
    map $status $status_non_2xx3xx {
        ~^[23]  0;
       default 1;
    }

    sendfile        on;
    tcp_nopush      on;
    tcp_nodelay     on;
    proxy_buffers 4 16k;
    proxy_busy_buffers_size 16k;
    proxy_buffer_size 16k;
  http_server_configuration_snippet: |
    set $masked_hostname "******";
    client_body_buffer_size 128k;
    client_header_buffer_size 5120k;
    large_client_header_buffers 16 5120k;
  http_server_location_configuration_snippet: |
  http_admin_configuration_snippet: |
  http_end_configuration_snippet: |
  stream_configuration_snippet: |
    map $status $status_2xx3xx {
        ~^[23]  1;
        default 0;
    }
    map $status $status_non_2xx3xx {
        ~^[23]  0;
       default 1;
    }

  http:
    enable_access_log: true         # enable access log or not, default true
    access_log:
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_2xx3xx,nohostname,severity=info main if=$status_2xx3xx
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_non_2xx3xx,nohostname,severity=info main if=$status_non_2xx3xx
    access_log_format: '{"cf_ipcountry":"$http_cf_ipcountry","http_x_client_ip":"$http_x_client_ip","http_True_Client_IP":"$http_True_Client_IP","upstream_http_X_Proxy_Cache":"$upstream_http_X_Proxy_Cache","request_id":"$request_id","route_id":"$http_route_id","request_length":"$request_length","remote_addr":"$remote_addr","remote_port":"$remote_port","request":"$request","args":"$args","uri":"$uri","status":"$status","bytes_sent":"$bytes_sent","http_user_agent":"$http_user_agent","http_x_forwarded_for":"$http_x_forwarded_for","http_host":"$http_host","server_name":"$server_name","request_time":"$request_time","upstream":"$upstream_addr","upstream_connect_time":"$upstream_connect_time","upstream_status":"$upstream_status","upstream_response_time":"$upstream_response_time","upstream_cache_status":"$upstream_cache_status","ssl_protocol":"$ssl_protocol","ssl_cipher":"$ssl_cipher","scheme":"$scheme","server_port":"$server_port","request_method":"$request_method","server_protocol":"$server_protocol","http_cf_ray":"$http_cf_ray","ty_lb_waf_id":"$http_ty_lb_waf_id","ty_lb_cc":"$http_ty_lb_cc","ty_lb_asn":"$http_ty_lb_asn"}'
    access_log_format_escape: json       # allows setting json or default characters escaping in variables
    keepalive_timeout: 60s          # timeout during which a keep-alive client connection will stay open on the server side.
    client_header_timeout: 60s      # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
    client_body_timeout: 60s        # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
    client_max_body_size: 0         # The maximum allowed size of the client request body.

    send_timeout: 30s              # timeout for transmitting a response to the client.then the connection is closed
    underscores_in_headers: "on"   # default enables the use of underscores in client request header fields
    real_ip_header: X-Real-IP      # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
    real_ip_recursive: "off"       # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_recursive
    real_ip_from:                  # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
      - 127.0.0.1
      - "unix:"

    proxy_ssl_server_name: true
    upstream:
      keepalive: 320                # Sets the maximum number of idle keepalive connections to upstream servers that are preserved in the cache of each worker process.
      keepalive_requests: 100000      # Sets the maximum number of requests that can be served through one keepalive connection.
      keepalive_timeout: 60s        # Sets a timeout during which an idle keepalive connection to an upstream server will stay open.
    charset: utf-8                  # Adds the specified charset to the "Content-Type" response header field, see
    variables_hash_max_size: 2048   # Sets the maximum size of the variables hash table.

    lua_shared_dict:
      internal-status: 100m
      plugin-limit-req: 100m
      plugin-limit-count: 100m
      prometheus-metrics: 1024m
      plugin-limit-conn: 100m
      upstream-healthcheck: 100m
      worker-events: 100m
      lrucache-lock: 100m
      balancer-ewma: 100m
      balancer-ewma-locks: 100m
      balancer-ewma-last-touched-at: 100m
      plugin-limit-count-redis-cluster-slot-lock: 100m
      tracing_buffer: 100m
      plugin-api-breaker: 100m
      etcd-cluster-health-check: 100m
      discovery: 100m
      jwks: 100m
      introspection: 100m
      access-tokens: 100m
      ext-plugin: 100m
      tars: 100m
      cas-auth: 100m

graphql:
  max_size: 1048576               # the maximum size limitation of graphql in bytes, default 1MiB

plugins:                          # plugin list (sorted by priority)
  - real-ip                        # priority: 23000
  - ai                             # priority: 22900
  - client-control                 # priority: 22000
  - proxy-control                  # priority: 21990
  - request-id                     # priority: 12015
  - zipkin                         # priority: 12011
  - ext-plugin-pre-req             # priority: 12000
  - fault-injection                # priority: 11000
  - mocking                        # priority: 10900
  - serverless-pre-function        # priority: 10000
  - cors                           # priority: 4000
  - ip-restriction                 # priority: 3000
  - ua-restriction                 # priority: 2999
  - referer-restriction            # priority: 2990
  - csrf                           # priority: 2980
  - uri-blocker                    # priority: 2900
  - request-validation             # priority: 2800
  - openid-connect                 # priority: 2599
  - cas-auth                       # priority: 2597
  - authz-casbin                   # priority: 2560
  - authz-casdoor                  # priority: 2559
  - wolf-rbac                      # priority: 2555
  - ldap-auth                      # priority: 2540
  - hmac-auth                      # priority: 2530
  - basic-auth                     # priority: 2520
  - jwt-auth                       # priority: 2510
  - key-auth                       # priority: 2500
  - consumer-restriction           # priority: 2400
  - forward-auth                   # priority: 2002
  - opa                            # priority: 2001
  - body-transformer               # priority: 1080
  - proxy-mirror                   # priority: 1010
  - proxy-cache-distributed        # priority: 1009
  - proxy-rewrite                  # priority: 1008
  - workflow                       # priority: 1006
  - api-breaker                    # priority: 1005
  - limit-conn                     # priority: 1003
  - limit-count                    # priority: 1002
  - limit-req                      # priority: 1001
  - gzip                           # priority: 995
  - server-info                    # priority: 990
  - multi-dc                       # priority: 967
  - traffic-split                  # priority: 966
  - redirect                       # priority: 900
  - response-rewrite               # priority: 899
  - degraphql                      # priority: 509
  - grpc-transcode                 # priority: 506
  - grpc-web                       # priority: 505
  - public-api                     # priority: 501
  - prometheus                     # priority: 500
  - datadog                        # priority: 495
  - elasticsearch-logger           # priority: 413
  - echo                           # priority: 412
  - loggly                         # priority: 411
  - http-logger                    # priority: 410
  - splunk-hec-logging             # priority: 409
  - skywalking-logger              # priority: 408
  - google-cloud-logging           # priority: 407
  - sls-logger                     # priority: 406
  - tcp-logger                     # priority: 405
  - kafka-logger                   # priority: 403
  - rocketmq-logger                # priority: 402
  - syslog                         # priority: 401
  - udp-logger                     # priority: 400
  - file-logger                    # priority: 399
  - clickhouse-logger              # priority: 398
  - tencent-cloud-cls              # priority: 397
  - inspect                        # priority: 200
  - example-plugin                 # priority: 0
  - aws-lambda                     # priority: -1899
  - azure-functions                # priority: -1900
  - openwhisk                      # priority: -1901
  - openfunction                   # priority: -1902
  - serverless-post-function       # priority: -2000
  - ext-plugin-post-req            # priority: -3000
  - ext-plugin-post-resp           # priority: -4000
  - ty-geoip                       # priority: -9000
stream_plugins: # sorted by priority
  - ip-restriction                 # priority: 3000
  - limit-conn                     # priority: 1003
  - mqtt-proxy                     # priority: 1000
  - syslog                         # priority: 401

plugin_attr:
  log-rotate:
    interval: 3600    # rotate interval (unit: second)
    max_kept: 168     # max number of log files will be kept
    max_size: -1      # max size bytes of log files to be rotated, size check would be skipped with a value less than 0
    enable_compression: true    # enable log file compression(gzip) or not, default false
  skywalking:
    service_name: APISIX
    service_instance_name: APISIX Instance Name
    endpoint_addr: http://127.0.0.1:12800
  opentelemetry:
    trace_id_source: x-request-id
    resource:
      service.name: APISIX
    collector:
      address: 127.0.0.1:4318
      request_timeout: 3
      request_headers:
        Authorization: token
    batch_span_processor:
      drop_on_queue_full: false
      max_queue_size: 1024
      batch_timeout: 2
      inactive_timeout: 1
      max_export_batch_size: 16
  prometheus:
    export_uri: /apisix/prometheus/metrics
    metric_prefix: apisix_
    enable_export_server: true
    export_addr:
      ip: 10.40.129.15
      port: 9091
  server-info:
    report_ttl: 60   # live time for server info in etcd (unit: second)
  dubbo-proxy:
    upstream_multiplex_count: 32
  request-id:
    snowflake:
      enable: false
      snowflake_epoc: 1609459200000   # the starting timestamp is expressed in milliseconds
      data_machine_bits: 12           # data machine bit, maximum 31, because Lua cannot do bit operations greater than 31
      sequence_bits: 10               # each machine generates a maximum of (1 << sequence_bits) serial numbers per millisecond
      data_machine_ttl: 30            # live time for data_machine in etcd (unit: second)
      data_machine_interval: 10       # lease renewal interval in etcd (unit: second)
  proxy-mirror:
    timeout:                          # proxy timeout in mirrored sub-request
      connect: 60s
      read: 60s
      send: 60s
  inspect:
    delay: 3            # in seconds
    hooks_file: "/usr/local/apisix/plugin_inspect_hooks.lua"

deployment:
  role: traditional
  role_traditional:
    config_provider: etcd
  admin:
    admin_key:
      -
        name: admin
        key: **********
        role: admin                 # admin: manage all configuration data

    enable_admin_cors: true         # Admin API support CORS response headers.
    allow_admin:                    # http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
      - 127.0.0.0/24                # If we don't set any IP list, then any IP access is allowed by default.
    admin_listen:                 # use a separate port
      ip: 127.0.0.1                 # Specific IP, if not set, the default value is `0.0.0.0`.
      port: 9180                  # Specific port, which must be different from node_listen's port.

    admin_api_mtls:               # Depends on `admin_listen` and `https_admin`.
      admin_ssl_cert: ""          # Path of your self-signed server side cert.
      admin_ssl_cert_key: ""      # Path of your self-signed server side key.
      admin_ssl_ca_cert: ""       # Path of your self-signed ca cert.The CA is used to sign all admin api callers' certificates.

    admin_api_version: v3         # The version of admin api, latest version is v3.

  etcd:
    host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "http://*.*.*.*:2379"
    - "http://*.*.*.*:2379"
    - "http://*.*.*.*2379"

    prefix: /apisix                 # configuration prefix in etcd
    use_grpc: false                 # enable the experimental configuration sync via gRPC
    timeout: 30                     # 30 seconds. Use a much higher timeout (like an hour) if the `use_grpc` is true.
    startup_retry: 2                # the number of retry to etcd during the startup, default to 2
    tls:

      verify: true                  # whether to verify the etcd endpoint certificate when setup a TLS connection to etcd,

Expected Behavior

Cpu should not be consumed %100

Error Logs

2023/07/04 03:15:53 [error] 161670#161670: 27856946269 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/routes/461091278096960700) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:...:8500': unlocked, context: ngx.timer, client: ...*, server: 0.0.0.0:80

Steps to Reproduce

Seems to be random. But 2500+ route/service and upstream. All with active healthchecks enabled

Environment

kingluo commented 1 year ago

@alptugay Which version of lua-resty-healthcheck do you use finally with error? Do the upstream servers change status frequently at that time? i.e. restart pods or shutdown or new upstreams added, or network transient errors?

I suggest using the lua-resty-healthcheck 2.2.3 and increasing the size of upstream-healthcheck, e.g. 500m and rechecking again.

And what's the configuration item redis_healthcheck_shared_dict_size? Have you changed the code of apisix or lua-resty-healthcheck?

monkeyDluffy6017 commented 1 year ago

Maybe it's same to https://github.com/apache/apisix/issues/9015

juzhiyuan commented 1 year ago

Hi @leslie-tsang,

Their team is under monitoring the status. They will update when necessary.

github-actions[bot] commented 1 year ago

Due to lack of the reporter's response this issue has been labeled with "no response". It will be close in 3 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

alptugay commented 1 year ago

We have encountered the issue once again. Our current findings are:

1) 4 out of 6 Apisix instances in the same environment were affected in different times in the same day. These are identical machines running with the same configs.

APISIX INSTANCE 1

2023/08/08 15:30:09 [error] 480719#480719: *24417911412 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:10 [error] 480724#480724: *24417921777 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:11 [error] 480739#480739: *24417932673 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:12 [error] 480716#480716: *24417943934 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:39 [error] 480728#480728: *24422048257 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:40 [error] 480735#480735: *24422057094 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:41 [error] 480723#480723: *24422065842 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:42 [error] 480733#480733: *24422074126 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:43 [error] 480736#480736: *24422083755 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80

APISIX INSTANCE 2

2023/08/08 14:23:39 [error] 608073#608073: *24408524288 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:41 [error] 608090#608090: *24408534735 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:42 [error] 608078#608078: *24408545713 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:43 [error] 608087#608087: *24408556591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80

APISIX INSTANCE 3

2023/08/08 16:29:39 [error] 67061#67061: *27466062548 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:40 [error] 67045#67045: *27466073872 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:41 [error] 67063#67063: *27466084337 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:42 [error] 67041#67041: *27466094834 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:34:33 [error] 67043#67043: *27468414132 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.191, server: 0.0.0.0:80
2023/08/08 16:34:34 [error] 67050#67050: *27468422768 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.94, server: 0.0.0.0:80
2023/08/08 16:34:35 [error] 67051#67051: *27468432136 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.94, server: 0.0.0.0:80
2023/08/08 16:34:36 [error] 67042#67042: *27468440315 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:13:03 [error] 291549#291549: *27482520502 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.196, server: 0.0.0.0:80
2023/08/08 17:13:04 [error] 67057#67057: *27482527129 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:13:05 [error] 67046#67046: *27482533105 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:13:06 [error] 291659#291659: *27482539195 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:22:33 [error] 291921#291921: *27485105465 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: 10.212.200.129, server: 0.0.0.0:443
2023/08/08 17:22:34 [error] 67058#67058: *27485109964 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:22:36 [error] 291582#291582: *27485115892 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.96, server: 0.0.0.0:80
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80

APISIX INSTANCE 4

2023/08/08 15:54:03 [error] 65587#65587: *27451715285 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:04 [error] 65602#65602: *27451726740 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:05 [error] 65606#65606: *27451738230 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:06 [error] 65596#65596: *27451750351 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.191, server: 0.0.0.0:80

2) As it can be seen in the logs the errors were related to 2 upstreams 58e0a3b27292c0d9267ce78a7dfa45ab and f51cbfa1fac84b2110ba93f947c93313

3) We have found that event though the backend of these upstreams were serving through HTTP, we have mistakenly set the scheme of the upstreams as HTTPS

4) Also the are duplicate IP-Port pairs in these upstream definitions. Below you can see the content of these upstreams:

For example:

"port": 32330,
"host": "XXX.XXX.XXX.30"

Upstream: 58e0a3b27292c0d9267ce78a7dfa45ab

    {
      "value": {
        "type": "roundrobin",
        "create_time": 1684840040,
        "timeout": {
          "connect": 300,
          "read": 300,
          "send": 300
        },
        "name": "thanos-query-store-api",
        "id": "58e0a3b27292c0d9267ce78a7dfa45ab",
        "hash_on": "vars",
        "desc": "source:consul",
        "pass_host": "pass",
        "update_time": 1691489116,
        "labels": {
          "source": "consul"
        },
        "scheme": "https",
        "nodes": [
          {
            "priority": 0,
            "port": 32347,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 32337,
            "weight": 1,
            "host": "XXX.XXX.XXX.29"
          },
          {
            "priority": 0,
            "port": 32337,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32496,
            "weight": 1,
            "host": "XXX.XXX.XXX.93"
          },
          {
            "priority": 0,
            "port": 32347,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 31141,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 32496,
            "weight": 1,
            "host": "XXX.XXX.XXX.198"
          },
          {
            "priority": 0,
            "port": 32190,
            "weight": 1,
            "host": "XXX.XXX.XXX.195"
          },
          {
            "priority": 0,
            "port": 30057,
            "weight": 1,
            "host": "XXX.XXX.XXX.92"
          },
          {
            "priority": 0,
            "port": 30238,
            "weight": 1,
            "host": "XXX.XXX.XXX.91"
          },
          {
            "priority": 0,
            "port": 31141,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 32190,
            "weight": 1,
            "host": "XXX.XXX.XXX.95"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 30057,
            "weight": 1,
            "host": "XXX.XXX.XXX.92"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 30238,
            "weight": 1,
            "host": "XXX.XXX.XXX.191"
          }
        ],
        "checks": {
          "active": {
            "unhealthy": {
              "timeouts": 3,
              "http_statuses": [
                429,
                404,
                500,
                501,
                502,
                503,
                504,
                505
              ],
              "interval": 1,
              "http_failures": 5,
              "tcp_failures": 2
            },
            "type": "tcp",
            "http_path": "/",
            "healthy": {
              "successes": 2,
              "http_statuses": [
                200,
                302
              ],
              "interval": 1
            },
            "timeout": 1,
            "concurrency": 10,
            "https_verify_certificate": true
          }
        }
      },
      "modifiedIndex": 1695071,
      "createdIndex": 1165,
      "key": "/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab"
    }

Upstream: f51cbfa1fac84b2110ba93f947c93313

{
      "value": {
        "type": "roundrobin",
        "create_time": 1684840040,
        "timeout": {
          "connect": 300,
          "read": 300,
          "send": 300
        },
        "nodes": [
          {
            "priority": 0,
            "port": 30249,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 31725,
            "weight": 1,
            "host": "XXX.XXX.XXX.94"
          },
          {
            "priority": 0,
            "port": 31914,
            "weight": 1,
            "host": "XXX.XXX.XXX.96"
          },
          {
            "priority": 0,
            "port": 31125,
            "weight": 1,
            "host": "XXX.XXX.XXX.194"
          },
          {
            "priority": 0,
            "port": 31569,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 31500,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 30593,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32708,
            "weight": 1,
            "host": "XXX.XXX.XXX.194"
          },
          {
            "priority": 0,
            "port": 31125,
            "weight": 1,
            "host": "XXX.XXX.XXX.92"
          },
          {
            "priority": 0,
            "port": 30593,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 31725,
            "weight": 1,
            "host": "XXX.XXX.XXX.93"
          },
          {
            "priority": 0,
            "port": 31569,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 31914,
            "weight": 1,
            "host": "XXX.XXX.XXX.94"
          },
          {
            "priority": 0,
            "port": 32708,
            "weight": 1,
            "host": "XXX.XXX.XXX.95"
          },
          {
            "priority": 0,
            "port": 31500,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 30249,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          }
        ],
        "pass_host": "pass",
        "hash_on": "vars",
        "desc": "source:consul",
        "id": "f51cbfa1fac84b2110ba93f947c93313",
        "update_time": 1689321033,
        "labels": {
          "source": "consul"
        },
        "scheme": "https",
        "name": "thanos-queryfrontend-api",
        "checks": {
          "active": {
            "unhealthy": {
              "timeouts": 3,
              "interval": 1,
              "http_statuses": [
                429,
                404,
                500,
                501,
                502,
                503,
                504,
                505
              ],
              "http_failures": 5,
              "tcp_failures": 2
            },
            "type": "tcp",
            "http_path": "/",
            "healthy": {
              "successes": 2,
              "http_statuses": [
                200,
                302
              ],
              "interval": 1
            },
            "timeout": 1,
            "concurrency": 10,
            "https_verify_certificate": true
          }
        }
      },
      "modifiedIndex": 1239801,
      "createdIndex": 1181,
      "key": "/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313"
    }

5) There is only one entry in the output of ss -t that contains an entry regarding to the upstream 58e0a3b27292c0d9267ce78a7dfa45ab

2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80

SYN-SENT 0 1 YYY.YYY.YYY.YYY:42790 XXX.XXX.XXX.29:32337

XXX.XXX.XXX.29:32337 is one of the upstreams of 58e0a3b27292c0d9267ce78a7dfa45ab

6) I am posting some logs just before we receive failed to release lock logs:

You can ignore worker process **** exited on signal 9 logs. Because we have a script that kills the workers as soon as it detects "failed to release lock" logs.

2023/08/08 17:22:27 [error] 291898#291898: *27485100457 [lua] events.lua:273: post(): worker-events: failed posting event "mostly_healthy" by "lua-resty-healthcheck [upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab]"; no memory while connect
ing to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.91:30238/", host: "thanos-query-store-api.AAAA.BBBB.com"
2023/08/08 17:22:27 [error] 291898#291898: *27485100457 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32190/", host: "thanos-query-store-api.AAAA.BBBB.com"
2023/08/08 17:22:28 [error] 291695#291695: *27485100523 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291632#291632: *27485100556 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67058#67058: *27485100578 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291940#291940: *27485100654 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67044#67044: *27485100658 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291520#291520: *27485100669 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291921#291921: *27485100780 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291682#291682: *27485100840 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291582#291582: *27485100855 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67077#67077: *27485100854 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291877#291877: *27485100859 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67072#67072: *27485100860 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.94:31725/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.93:31725/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.194:31125/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.194:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.92:31125/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.96:31914/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.94:31914/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 [lua] balancer.lua:362: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291921#291921: *27485105465 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: YYY.YYY.YYY.129, server: 0.0.0.0:443
2023/08/08 17:22:34 [alert] 27489#27489: worker process 291921 exited on signal 9
2023/08/08 17:22:34 [error] 67058#67058: *27485109964 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:22:35 [alert] 27489#27489: worker process 67058 exited on signal 9
2023/08/08 17:22:36 [error] 291582#291582: *27485115892 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.96, server: 0.0.0.0:80
2023/08/08 17:22:37 [alert] 27489#27489: worker process 291582 exited on signal 9
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:22:38 [alert] 27489#27489: worker process 67072 exited on signal 9

I think the SSL errors are related to our finding number 3 worker-events dict size is: 1024m

7) We also have an interesting finding.

This is a normal log (after we kill the affected workers) for the related service, it is normal that all upstreams are 502 because of our finding number 3

{"@timestamp":"2023-08-08T17:56:03.484664+03:00", "_team":"traffic", "_app":"loadbalancer", "_federation":"AAA", "_region":"BBB", "_environment":"prod", "_cluster":"internal", "_hostname":"Apisix01", "_service":"loadbalancer", "tag":"lb_access_non_2xx3xx:", "severity":"6", "facility":"23", "severity_label":"info", "facility_label":"local7", "cf_ipcountry":"", "http_x_client_ip":"", "http_True_Client_IP":"", "upstream_http_X_Proxy_Cache":"", "request_id":"5364e8127a236e359699bffe80851093", "route_id":"", "request_length":"120", "remote_addr":"XXX.XXX.XXX.196", "remote_port":"10506", "request":"GET / HTTP/1.1", "args":"", "uri":"/", "status":"502", "bytes_sent":"754", "http_user_agent":"Go-http-client/1.1", "http_x_forwarded_for":"", "http_host":"thanos-queryfrontend-api.aaaa.bbbb.com", "server_name":"_", "request_time":"0.012", "upstream":"XXX.XXX.XXX.95:32708, XXX.XXX.XXX.31:31500, XXX.XXX.XXX.31:30249, XXX.XXX.XXX.33:30249, XXX.XXX.XXX.30:31569, XXX.XXX.XXX.33:31500, XXX.XXX.XXX.30:30593, XXX.XXX.XXX.33:30593, XXX.XXX.XXX.33:31569, XXX.XXX.XXX.94:31725, XXX.XXX.XXX.96:31914, XXX.XXX.XXX.194:31125, XXX.XXX.XXX.194:32708, XXX.XXX.XXX.92:31125, XXX.XXX.XXX.93:31725, XXX.XXX.XXX.94:31914", "upstream_connect_time":"-, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -", "upstream_status":"502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502", "upstream_response_time":"0.002, 0.000, 0.000, 0.001, 0.000, 0.000, 0.001, 0.000, 0.000, 0.002, 0.000, 0.001, 0.001, 0.001, 0.001, 0.001", "upstream_cache_status":"", "ssl_protocol":"", "ssl_cipher":"", "scheme":"http", "server_port":"80", "request_method":"GET", "server_protocol":"HTTP/1.1", "http_cf_ray":"", "ty_lb_waf_id":"", "ty_lb_cc":"", "ty_lb_asn":""}

But this is an unusual log, and the logs related to this service during the problem is like the log below:

{"@timestamp":"2023-08-08T17:37:03.482507+03:00", "_team":"traffic", "_app":"loadbalancer", "_federation":"AAA", "_region":"BBB", "_environment":"prod", "_cluster":"internal", "_hostname":"Apisix01", "_service":"loadbalancer", "tag":"lb_access_non_2xx3xx:", "severity":"6", "facility":"23", "severity_label":"info", "facility_label":"local7", "cf_ipcountry":"", "http_x_client_ip":"", "http_True_Client_IP":"", "upstream_http_X_Proxy_Cache":"", "request_id":"ef16786dd332690d1fd06228bb08788c", "route_id":"", "request_length":"120", "remote_addr":"XXX.XXX.XXX.94", "remote_port":"61553", "request":"GET / HTTP/1.1", "args":"", "uri":"/", "status":"502", "bytes_sent":"680", "http_user_agent":"Go-http-client/1.1", "http_x_forwarded_for":"", "http_host":"thanos-queryfrontend-api.aaaa.bbbb.com", "server_name":"_", "request_time":"0.014", "upstream":"XXX.XXX.XXX.96:31914, XXX.XXX.XXX.194:31125, XXX.XXX.XXX.194:32708, XXX.XXX.XXX.92:31125, XXX.XXX.XXX.93:31725, XXX.XXX.XXX.94:31914, XXX.XXX.XXX.95:32708, XXX.XXX.XXX.94:31725 : ", "upstream_connect_time":"-, -, -, -, -, -, -, - : ", "upstream_status":"502, 502, 502, 502, 502, 502, 502, 502 : ", "upstream_response_time":"0.002, 0.001, 0.005, 0.001, 0.001, 0.001, 0.002, 0.001 : ", "upstream_cache_status":"", "ssl_protocol":"", "ssl_cipher":"", "scheme":"http", "server_port":"80", "request_method":"GET", "server_protocol":"HTTP/1.1", "http_cf_ray":"", "ty_lb_waf_id":"", "ty_lb_cc":"", "ty_lb_asn":""}

Please look at the upstream, upstream_status, upstream_connect_time, upstream_response_time. They are like incomplete. They all and with colon character.:

juzhiyuan commented 1 year ago

Hello @alptugay,

The issue is under fixing, check https://github.com/api7/lua-resty-healthcheck/commit/f1d81af1601915c7dcbe160078d4f1d09424ba36 for more information.

Sn0rt commented 1 year ago

@alptugay have you try this fix ? https://github.com/api7/lua-resty-healthcheck/commit/f1d81af1601915c7dcbe160078d4f1d09424ba36

Revolyssup commented 1 year ago

@alptugay you can try with apisix version 3.4>= where the lua-resty-healthcheck library is upgraded

Revolyssup commented 1 year ago

Fixed by https://github.com/api7/lua-resty-healthcheck/commit/f1d81af1601915c7dcbe160078d4f1d09424ba36