bug: request timed out when upstream pod is scaling down

zhuoyang commented 1 year ago

Current Behavior

I am facing request timed out when upstream pods scales down I've tested this by running k6s against the golang hello world service exposed via APISIX, when I trigger a scale down from 3 replicas to 1 replicas, request timed out error started showing up roughly a minute later. I've also cross tested this with ingress-nginx and the similar behaviour does not happen.

Expected Behavior

Request timed out should not occur

Error Logs

This might be related but I am not sure k6 Logs:

WARN[0227] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"
WARN[0228] Request Failed                                error="Get \"https://REDACTED/staging-apigw-loadtest/\": request timeout"

APISIX Logs:

2023/11/02 08:57:40 [error] 49#49: *308877 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.2.7.175, server: _, request: "GET /staging-apigw-loadtest/ HTTP/2.0", upstream: "http://10.2.47.135:8080/", host:  REDACTED
2023/11/02 08:57:41 [error] 51#51: *309863 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.2.15.107, server: _, request: "GET /staging-apigw-loadtest/ HTTP/2.0", upstream: "http://10.2.47.135:8080/", host: REDACTED
2023/11/02 08:57:41 [error] 51#51: *309123 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.2.18.115, server: _, request: "GET /staging-apigw-loadtest/ HTTP/2.0", upstream: "http://10.2.47.135:8080/", host: REDACTED
2023/11/02 08:57:41 [error] 52#52: *308838 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.2.7.175, server: _, request: "GET /staging-apigw-loadtest/ HTTP/2.0", upstream: "http://10.2.47.135:8080/", host:  REDACTED
2023/11/02 08:57:41 [error] 51#51: *308757 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.2.7.175, server: _, request: "GET /staging-apigw-loadtest/ HTTP/2.0", upstream: "http://10.2.47.135:8080/", host:  REDACTED

Steps to Reproduce

Create a Deployment which host a simple endpoint that returns hello world
Expose the deployment
Create an ApisixRoute which point to the service
Scale the Deployment up to 3 replicas
Run k6 against the endpoint
Scale the Deployment down to 1 replica
Wait for 1 minutes
Request timed out warning shows up in k6

Environment

APISIX Ingress controller version: v1.7.1 (compiled myself using from the v1.7.1 tag)
Kubernetes cluster version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.17-eks-be83b3c", GitCommit:"d6adb24671245f68ce3cd985f6b68f124953968d", GitTreeState:"clean", BuildDate:"2023-09-27T17:22:23Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

juzhiyuan commented 1 year ago

P.S. This issue is using etcd-less mode.

shreemaan-abhishek commented 1 year ago

@zhuoyang could you please share how you install apisix-ingress? I intend to not to have any differences in the installation process.

zhuoyang commented 1 year ago

I am using the helm chart with config.etcdserver.enabled: true

zhuoyang commented 1 year ago

After more investigation on this, I believe this is a race condition where the pod terminates faster than APISIX removing the pod from upstream I managed to fix this issue by setting a prestop hook that sleeps for 15 seconds, if the application take more time to terminate then this issue shouldn't occur. I am not sure if this is something apisix ingress controller can fix, but worth mentioning that ingress-nginx doesn't have the same issue

shreemaan-abhishek commented 1 year ago

I am using the helm chart with config.etcdserver.enabled: true

But @juzhiyuan mentioned that you are using etcd-less installation method. Isn't that the standalone mode? Or probably I don't have enough context, I might have to catchup with the docs/code.

shreemaan-abhishek commented 1 year ago

the pod terminates faster than APISIX removing the pod from upstream

This is very likely to occur as the updates in the gateway might have some latency in being reflected to the ingress controller.

zhuoyang commented 1 year ago

Isn't that the standalone mode?

It is not the standalone mode, it is the composite architecture

lingsamuel commented 1 year ago

I can't reproduce. Here is my reproduce steps (kn is shorthand for kubectl -n $current_ns):

install apisix-ingress-controller with options:

--set image.repository="apache/apisix-ingress-controller" \
--set image.tag="1.7.0" \
--set config.etcdserver.enabled=true

deploy and expose workload by

kn create deployment httpbin --image=kennethreitz/httpbin --port 80
kn expose deploy/httpbin --port 80

create ApisixRoute

apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
name: httpbin-route
spec:
http:
- name: rule1
match:
  hosts:
  - httpbin.com
  paths:
  - /*
backends:
- serviceName: httpbin
  servicePort: 80

scale httpbin deployment replica to 3

simply start a k6 pod to run script

kn run k6 --image grafana/k6 --image-pull-policy IfNotPresent --command -- sh -c "trap : TERM INT; sleep 99d & wait"

inside k6 pod, execute a demo script from its tutorial


import { check } from "k6";
import http from "k6/http";

// define configuration export const options = { // define thresholds thresholds: { http_req_failed: ['rate<0.01'], // http errors should be less than 1% http_req_duration: ["p(99)<1000"], // 99% of requests should be below 1s

}, // define scenarios scenarios: { // arbitrary name of scenario average_load: { executor: "ramping-vus", stages: [ // ramp up to average load of 20 virtual users { duration: "5s", target: 20 }, // maintain load { duration: "120s", target: 20 }, // ramp down to zero { duration: "5s", target: 0 }, ], }, } };

export default function () { // define URL and request body const url = "http://apisix-apisix-ingress-controller-apisix-gateway/post"; const payload = JSON.stringify({ username: "test_case", password: "1234", });

const params = { headers: { "Content-Type": "application/json", "Host":"httpbin.com", }, };

// send a post request and save response as a variable const res = http.post(url, payload, params);

// check that response is 200 check(res, { "response code was 200": (res) => res.status == 200, });

}


run it by
```bash
k6 run api-test.js  --http-debug --log-output=stdout > log.log

scale httpbin deployment replica to 1
wait for k6 script end
get the log, it is 200M+, grep it and see if there are any non-200 responses

Only a single 502 response. I think it's acceptable. No timeout requests found.

zhuoyang commented 1 year ago

Could you try to run with higher and constant VUs? I think it might just happens within split seconds, so higher VUs might have a better chance of reproducing

import http from "k6/http";
import { sleep } from 'k6';

export const options = {
        stages: [
            { duration: '0s', target: 1000},
            { duration: '300s', target: 1000}
        ]
};
export default function apisix() {
    http.get("http://apisix-apisix-ingress-controller-apisix-gateway/");
    sleep(Math.random() * 5);
}

lingsamuel commented 1 year ago

same

zhuoyang commented 1 year ago

Could you try to grep Request Failed instead?

lingsamuel commented 1 year ago

no such lines

lingsamuel commented 1 year ago

You can try to set ResolveGranularity to service to avoid endpoint resolve.

zhuoyang commented 1 year ago

You can try to set ResolveGranularity to service to avoid endpoint resolve.

that is good to know. I believe this issue only happens when the application terminate faster than endpoint update, in my case it is just a simple webserver written in Go. For now I'll stick with the prestop hook option, you may close this issue if you believe there's nothing to be improved on APISIX Ingress controller side.

Sn0rt commented 1 year ago

After previous phone discussions, we discovered that nginx actually has the same problem.

When the pod is scaled down, the endpoint disappears to the apisix-ingess-controller. It takes time to delete the upstream corresponding to the corresponding endpoint from APISIX. Although this period of time is short, it does exist. This is why 502 was also discovered in the previous stress test.

Thank you for your solution, we will write this solution into the best practices of apisix ingress contorller.

can we close this issue ?

zhuoyang commented 1 year ago

yeah sure, thanks!

apache / apisix-ingress-controller