Azure | KIC and Kong | Chokes Under Load

narayaar commented 4 years ago

KIC Version: 0.7.1 Kong CE or EE: Kong CE v2.0.4 K8S Version: 1.7.11 Docker CE Version: 19.03.11

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.6", GitCommit:"d32e40e20d167e103faf894261614c5b45c44198", GitTreeState:"clean", BuildDate:"2020-05-20T13:16:24Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.11", GitCommit:"ea5f00d93211b7c80247bf607cfa422ad6fb5347", GitTreeState:"clean", BuildDate:"2020-08-13T15:11:47Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Environment

Cloud provider or hardware configuration: Azure (Standard_D16s_v3)
OS: CentOS 7.8
Kernel: 3.10.0-1127.19.1.el7.x86_64
Install tools: Rancher UI (v2.4.4)
Others:
- CNI: Calico (v3.13.4)
- Kong DataStore: DB (Azure PostgreSQL)
- # of Workers: 2
- Load Test Tool: vegeta
- Relevant Tunables in Kong's ConfigMap:
  - KONG_PROXY_LISTEN: 0.0.0.0:9000 reuseport, 0.0.0.0:9443 ssl reuseport
  - Under kong.conf, nginx_main_worker_rlimit_nofile = 262144
- Ingress Type - Custom (segregated namespace).
- Workload Controller - Deployment
- POD Limits - None.
- Total Number of Objects (Routes/Services/Upstreams/Plugins/Consumers) - 200 max.

What happened

Load tests to the external backend, via the Kong Proxy NodePort, generate a ton of 5XX status codes (502/504) when running 5000 rps for 3 minutes. constantly. The error logs are inconsistent (which could be due to the load, choking resources, not writing to the error.log properly). These are the errors I've been able to trap, under different tests.

_[error] 33#0: in lua semaphore gc wait queue is not empty while the semaphore 00007F54212A1CB8 is being destroyed [error] 33#0: 3426060 upstream prematurely closed connection while reading response header from upstream [crit] 39#0: 240711 connect() to XXXXXXX:443 failed (99: Address not available) while connecting to upstream [error] 43#0: 10030416 peer closed connection in SSL handshake (104: Connection reset by peer) while SSL handshaking to upstream [error] 33#0: 6478079 [lua] connector.lua:353: unable to clean expired rows from PostgreSQL database (receivemessage: failed to get type: timeout), context: ngx.timer

Load tests to the external backend, bypassing the k8s subsystem, succeed under the same condition (a small number of 502s are generated, but that is expected). These tests are run from the OS directly, on the worker nodes.

Expected behavior Load tests to the external backend, via the Kong Proxy NodePort, should ideally generate the same results.

Steps To Reproduce

Spawn 2 Azure Standard_D16s_v3 CentOS 7 Instances.
Create a k8s cluster via the Rancher UI, and assign both instances to it. I've assigned a separate one for controlplace/etcd, but that should not be a problem.
Deploy Kong on both instances.
Create the routes/services/upstream/plugins in another namespace.
Run the load tests.

Extra Information

Some reference to context; we initially started seeing the 502s (POST being replaced with GET), backed by the infamous "upstream prematurely closed connection" error msg, in a Kong cluster (Prod) a month or so ago, when a customer complained that they were getting 502s. On further enquiry, we identified out that the customer was running jobs that were generating 120000 rps in a timeframe of 15 minutes with 1-2 G total upload size (4 times a day, 3-4 days a week). When this was happening, the Kong instances were choking, affecting comms with all pods directing traffic through Kong (timeouts etc.). Moreover, this customer is new, with none of the other customers generating this load. Therefore, no fact check to refer to. Rate limiting has been suggested, but the customer is clear that this load isn't really great and should be easily manageable. Considering the service they were communicating to points to an external backend via Kong, we rebuilt the microservice to point to the external backend directly, and so far, no complaints.

Therefore, a test environment was created to see how far Kong could go with different load test parameters, and this is where we started identifying issues. Deployments vs Daemonsets, no avail. Changing vm sizes, from Standard_D16s_v3 all the way to Standard_F32s_v2, no avail. All exercises stump me, especially considering Kong should perform very well, under this environment.

I'm currently grasping at straws, so your help would be highly appreciated, pointing me in the right direction. I have the test environment ready. Thank you.

hbagdi commented 4 years ago

If you can upgrade to 2.13 (or 2.1.4 if available (being release as I type this out)), that would help. I understand that this in production so it might not be feasible to alleviate the problem quickly.

Could you share the number of workers of nginx you have?

@dndx @bungle @javierguerragiraldez might have more insights in the error messages that kong is reporting.

narayaar commented 4 years ago

Thanks for getting back @hbagdi. I've gone ahead and upgraded Kong CE to 2.1.4. The number of workers is 16, currently, since the vm size is Standard_D16s_v3. The vm size can be changed, if the need arises.

Let me run some load tests and get back to this group.

narayaar commented 4 years ago

Hello @hbagdi: My apologies for the delay in getting back.

Kong CE 2.1.4 seems to be working well; 502s (POSTs logged as GETs) are not as frequently reported, and a bunch of other lua-related errors do not crop up anymore.

After running multiple load tests with different instance sizes to identify what works best, this is what I settled for;

Sizes: Standard_F16s_v2 (16 vCPUs / 32G vMem) for both k8s worker nodes, running the KIC/kong pods.
Kong Relevant Tunable: nginx_main_worker_rlimit_nofile = 1048576
Load Test Run: 5000 rps.

Flow: Vegeta Client -> Kong Data Container (Node Port / https) -> External Service (https).

It has been observed that both k8s worker nodes can handle a combined load of 10000 rps (either 10000 rps to a single k8s worker node, or a combination of rps to both k8s worker nodes not exceeding 10000 rps, in total). If the # of rps goes past 10000, then multiple 502/504 are observed, in addition to "connection reset by peer" errors generated from the client. Based on the instance size and the tunable, each node should be able to easily support more than 10000 rps.

I do feel the next step would be to look at either changing the CNI or enforcing a set of tcp-related kernel tunables. I will report back once I compare performances against other CNIs, supported by Rancher2.

In the interim, do you have any idea where the limitation would be?

humayunjamal commented 3 years ago

@narayaar any update on this ? did you found a solution ?

narayaar commented 3 years ago

Hello @humayunjamal: Yes. On our side, we had work on the following areas;

Reusing plugins to ensure the overheads were less.
Kernel tunables on the k8s workers and kong data pods.
Worked with Azure with regards to the WAF v2 compute unit (bugs + scale).

In addition, we are planning to upgrade both KIC and Kong CE to the latest versions out there. Things should be much better with all the bugs that have been fixed.

humayunjamal commented 3 years ago

@narayaar Appreciate your swift response mate , thanks :)

zffocussss commented 3 years ago

is there update?

Kong / kubernetes-ingress-controller

Azure | KIC and Kong | Chokes Under Load #860