apache / apisix

The Cloud-Native API Gateway
https://apisix.apache.org/blog/
Apache License 2.0
14.58k stars 2.53k forks source link

bug: Upstream HealthCheck Issue - Unhealthy Upstream doesn't be excluded temporarily #11016

Open kworkbee opened 9 months ago

kworkbee commented 9 months ago

Current Behavior

Same as apache/apisix-ingress-controller#2176.

There is a problem with the unhealthy external service being delivered as it is without being excluded from routing targets.

Mar-06-2024 17-37-44

Expected Behavior

Two external services (ALB configured in front of each) are configured as upstream nodes and should be temporarily excluded from routing if a 5XX error occurs through health check configuration.

Error Logs

No response

Steps to Reproduce

For reproducing the issue, one service is deployed, and the other one is not deployed (only ALB's are set up.)

apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
  name: route
  namespace: apisix
spec:
  http:
    - match:
        hosts:
          - kubernetes.corp.com
        methods:
          - POST
        paths:
          - /svc/*
      name: route
      plugins:
        - config:
            regex_uri:
              - ^\/svc\/(.+)$
              - /$1
          enable: true
          name: proxy-rewrite
      upstreams:
        - name: upstream
          weight: 100
apiVersion: apisix.apache.org/v2
kind: ApisixUpstream
metadata:
  name: upstream
  namespace: apisix
spec:
  externalNodes:
    - name: svc01.corp.com
      port: 443
      type: Domain
      weight: 50
    - name: svc02.corp.com
      port: 443
      type: Domain
      weight: 50
  healthCheck:
    active:
      healthy:
        httpCodes:
          - 200
          - 404
        interval: 3s
        successes: 1
      httpPath: /
      type: https
      unhealthy:
        httpCodes:
          - 500
          - 501
          - 502
          - 503
          - 504
        httpFailures: 1
        tcpFailures: 1
        interval: 3s
        timeouts: 3
    passive:
      healthy:
        httpCodes:
          - 200
          - 404
        successes: 1
      type: https
      unhealthy:
        httpCodes:
          - 500
          - 501
          - 502
          - 503
          - 504
        httpFailures: 1
        tcpFailures: 1
        timeouts: 3
  loadbalancer:
    type: roundrobin
  passHost: node
  scheme: https

Environment

APISIX Ingress controller version (run apisix-ingress-controller version --long) Kubernetes cluster version (run kubectl version) OS version if running APISIX Ingress controller in a bare-metal environment (run uname -a) Runs on an AWS EKS Cluster (Kubernetes v1.25). Uses APISIX Helm Chart (1.11.0, App 3.8.0).

hanqingwu commented 8 months ago

Can you retrieve health check information ?
curl -i http://127.0.0.1:9090/v1/healthcheck

shreemaan-abhishek commented 8 months ago

does this bug exist even if you don't use the ingress controller?

kworkbee commented 8 months ago

@hanqingwu The node that should be Unhealthy is marked Healthy.

Log shows below (Failed SSL Handshake):

2024/03/11 06:46:31 [error] 50#50: *4510567 [lua] healthcheck.lua:1383: log(): [healthcheck] (upstream#/apisix/upstreams/23eb23c7) failed SSL handshake with 'X.X.X.X (X.X.X.X:443)', using server name (sni) 'svc02.corp.com': 19: self-signed certificate in certificate chain, context: ngx.timer, client: X.X.X.X, server: 0.0.0.0:9080

@shreemaan-abhishek The same symptom appears even when the ingress controller is not deployed.

shreemaan-abhishek commented 8 months ago

@kworkbee please share repro steps for apisix.

kworkbee commented 8 months ago

@shreemaan-abhishek I would like to apply it in the following form. Image

With Helm Chart, APISIX is installed in the tools cluster and ApisixRoute/ ApisixUpstream objects are deployed as written in the description above.

I want to configure it to route to 50:50 and when certain clusters fail, I want to adjust the weight to the rest of the cluster.

However, despite the Upstream Health Check setting, there is a problem that it is not possible to automatically exclude Upstream, which is currently 503.

The parts found in the APISIX Log are as follows.

2024/03/18 11:46:03 [error] 49#49: *14808 [lua] healthcheck.lua:1383: log(): [healthcheck] (upstream#/apisix/upstreams/32eb11c7) failed SSL handshake with 'X.X.X.X (X.X.X.X:443)', using server name (sni) 'svc01.corp.com': 19: self-signed certificate in certificate chain, context: ngx.timer, client: X.X.X.X, server: 0.0.0.0:9080
2024/03/18 11:46:06 [warn] 49#49: *14846 [lua] balancer.lua:82: fetch_health_nodes(): failed to get health check target status, addr: X.X.X.X:443, host: nil, err: target not found, client: X.X.X.X, server: _, request: "POST /feature-flags/flagd.evaluation.v1.Service/ResolveBoolean HTTP/1.1", host: "kubernetes.corp.com"
kworkbee commented 8 months ago

19: self-signed certificate in certificate chain Does that matter?

kworkbee commented 8 months ago

@shreemaan-abhishek Can you please take a look?