Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
39.27k stars 4.82k forks source link

KONG performance break down while updating entity via admin API #7543

Closed 23henne closed 11 months ago

23henne commented 3 years ago

Summary

We are currently facing a production issue. Once we apply a change via PATCH on one particular route entity, KONG nodes are not able to operate on normal level anymore. It does not accept requests or at least processing it is very slow. See some statistics attached.

grafik

Also latency explodes in such a time.

grafik

How to find out if we suffer from bad performance during cache invalidation?

We run a six node cluster using cassandra as a datastore. We cannot see any problems on database level. As per documentation (at least this is what I understand) only particular cache key will be invalidated once it was updated and not whole cache. Why do I face issues on all services/routes when changing only one particular entity?

We use 2.2.1 version. I don’t have any clue how to identify the root cause :frowning:

Thanks in advance Henning

bungle commented 3 years ago

@23henne do you have worker_consistency set to strict or eventual? https://github.com/Kong/kong/blob/master/kong.conf.default#L1309

23henne commented 3 years ago

Good catch, it defaults to strict. So, we will try with eventual to enhance performance. This sounds very promising!!!

bungle commented 3 years ago

@23henne, yes let's see if we can drop strict on 3.x. Let us know if that helped.

23henne commented 3 years ago

@bungle Unfortunately, it did neither fix nor mitigate the issue :(

Questions

Thanks Henning

dndx commented 3 years ago

Hello @23henne.

23henne commented 3 years ago

Thanks @dndx ! We really don't have a clue. We cannot see any bottleneck on DB level. One question as we use 2.2.1 version. I can see PR concerning worker consistency coming with 2.3.3. Might this exactly be our problem?

locao commented 3 years ago

Hi @23henne!

6833 is related to upstream entities, it should not matter if you are not patching an upstream. Anyway, we recommend you to upgrade to the most recent version.

23henne commented 3 years ago

Hello @locao! We upgraded to 2.4.1 last week. Let's say, 'perceived performance' is better, e.g. we can fire 10 PATCH requests in a row, firsts are very fast (<300 ms), future requests 6,7,8 might take longer (up to 9 seconds). We will add response times to admin API access logs to have reliable values. During our tests also latencies for customer requests were not that high as it was with former version but maybe it's just coincidence.

Coming back to your questions:

Do you see any [error] messages in the error.log? Yes, I can see timeouts during healthchecks on upstream targets and DNS resolution.

2021/07/06 10:57:36 [error] 67872#0: *9959743 [lua] healthcheck.lua:1104: log(): [healthcheck] (dcbc966e-8f84-47ec-8b85-92a157f5ee98:ws_xxx) failed to receive status line from 'x.x.x.x (x.x.x.x:xxxxx)': timeout, context: ngx.timer, 67872#0: *9959743 [lua] healthcheck.lua:1104: log(): [healthcheck] (dcbc966e-8f84-47ec-8b85-92a157f5ee98:ws_xxx) failed to receive status line from 'x.x.x.x (x.x.x.x:xxxxx)': timeout, context: ngx.timer

2021/07/06 10:57:40 [error] 200960#0: *9233671 [lua] balancer.lua:1061: execute(): DNS resolution failed: dns client error: 101 empty record received. Tried: ["(short)xxxxx:(na) - cache-miss","xxxxx:1 - cache-miss/in progress (sync)/try 1 error: failed to receive reply from UDP server xxxxx:53: timeout/scheduled/querying/try 2 error: timeout/dns lookup pool exceeded retries (1) ....

But we know such targets were available.

Can you share more details on your PATCH request? It's patching an existing route resource changing response_buffering to false. That's it.

What is the behavior during the low performance period? Do clients get connection refused errors, timeouts, 500s? I can see high amount of HTTP 408, 499, 502, 503.

To be honest if I'd not know it better for me it looks like we hit some kind of TCP or resource limit. It is striking that we have huge CPU consumption during this time frame. I thought it was caused by update propagation, that's why it was very promising to play with worker consistency setting.

grafik

It is really strange. Maybe it's worth to mention that we have 2k RPS on our 6 instances.

dndx commented 3 years ago

Hello @23henne ,

Could you check your DNS resolver's performance? If the DNS resolver is slow then it may cause new requests to be blocked on DNS resolution after PATCH.

locao commented 3 years ago

Hi @23henne, could you try using target IP addresses instead of host names, so we can pinpoint if this is related to DNS or not?

Out of curiosity, unrelated to this issue, are you using https to active health check the upstreams?

23henne commented 3 years ago

@locao, are we talking about upstreams? We don't use hostnames in there! But we also have service targets which are not referring to KONG upstreams but some other external destination using domain name.

We use active health check but is HTTP, not HTTPS.

This is what happens if I perform 3000 requests like this

[26/Jul/2021:05:07:38 +0200] "PUT /services/pcc-proxy/routes/route-xxx-1714 HTTP/1.1" 200 572 "-" "Java-SDK"
[26/Jul/2021:05:07:38 +0200] "GET /routes/3cac7159-aa31-4955-a866-f9e5basd7ee/plugins HTTP/1.1" 200 23 "-" "Java-SDK"
[26/Jul/2021:05:07:38 +0200] "POST /routes/3cac7159-aa31-4955-a866-f9e5basd7ee/plugins HTTP/1.1" 201 477 "-" "Java-SDK"

via KONG admin API within 50 minutes.

grafik

grafik

And this applies for all cluster nodes. I have no clue.

23henne commented 3 years ago

Same today morning, migrated three services to new targets (used kong decK in this case).

[27/Jul/2021:07:52:25 +0200] "PUT /services/afef86df-fc60-4315-be88-87a0eb71ec6f HTTP/1.1" 200 408 "-" "Go-http-client/1.1"
[27/Jul/2021:07:52:25 +0200] "PUT /services/deb08b99-613e-4778-8c28-77b33674f87d HTTP/1.1" 200 406 "-" "Go-http-client/1.1"
[27/Jul/2021:07:52:25 +0200] "PUT /services/9323fada-fde3-4e9f-bb7a-0d358739b72b HTTP/1.1" 200 428 "-" "Go-http-client/1.1"
[27/Jul/2021:07:52:25 +0200] "PUT /services/9d896f92-4c14-46b6-bf86-66697187c772 HTTP/1.1" 200 415 "-" "Go-http-client/1.1"
[27/Jul/2021:07:52:25 +0200] "PUT /plugins/3da583e2-c5ea-4a0f-9842-15d6c1a26352 HTTP/1.1" 200 248 "-" "Go-http-client/1.1"

Latency explodes, request time per routes increases up to nearly one minute (for all services, not only the one I touched before). grafik

CPU maxes out on all nodes of the cluster. grafik

dndx commented 3 years ago

Hello @23henne,

Could you share us the debug level logs for Kong when the latency spike happens?

23henne commented 3 years ago

Unfortunately, we are not able to reproduce it on one of our preprod environments. We enabled debug on one of our let's call it slave machines (which are not used for admin API communication) while problem occurred. As it's production data will take some time to review and anonymize it.

What we did see while facing the problem is nginx worker processes were completely busy (i.e. 100% CPU usage). grafik

Even if four worker processes should be enough we gave it a try increasing to 16 (we have 28 cores available on this machine). As a consequence it was 16 processes busy instead of four 😄 I was really interested what they did in this moment.

dndx commented 3 years ago

Hello @23henne,

Yeah, please share the debug logs with us once it is ready. If you don't feel comfortable uploading it on GitHub, you can also email the logs to me at:

datong.sun-konghq.com (replace - with @)

Just mention the issue # in the email.

Also, your system CPU usage isn't that high (84% idling), I wonder if increasing the number of Nginx workers on the machine can help reduce the CPU usage of individual workers?

23henne commented 3 years ago

Hello @dndx,

thank you, hopefully we can provide logfiles shortly.

As I said we already increased worker processes to 16 per node and also in this case all were more or less busy. Don't you think it was enough? Shame on me but I didn't execute an strace during the issue, maybe it would have helped.

While going thru documentation we found following paragraph concerning mem_cache_size parameter.

We recommend defining the mem_cache_size configuration as large as possible, while still providing adequate resources to the operating system and any other processes running adjacent to Kong. This configuration allows Kong to take maximum advantage of in-memory cache, and reduce the number of trips to the database. Additionally, each Kong worker process maintains its own memory allocations, and must be accounted for when provisioning memory. By default one worker process runs per number of available CPU cores. In general, we recommend allowing for ~ 500MB of memory allocated per worker process. Thus, on a machine with 4 CPU cores and 8 GB of RAM available, we recommend allocating between 4-6 GB to cache via the mem_cache_size directive, depending on what other processes are running alongside Kong.

tbh this confuses me. It defaults to 128M but you recommend 500M or as large as possible.

Maybe you can help us to find a proper sizing for our demands.

This is what our current configuration looks like. We were also wondering if it was worth it to play with update frequency parameters.

prefix = /abc/apps/kong-pu/

log_level = warn

proxy_access_log = /abc/logs/kong-pu/access.log kong_proxy if=$keeplog
proxy_error_log = /abc/logs/kong-pu/error.log

admin_access_log = /abc/logs/kong-pu/admin_access.log
admin_error_log = /abc/logs/kong-pu/error.log

plugins = bundled,kong-plugin-redirect,kong-page-mgmt-redirect,error-transformer-handler
anonymous_reports = off
proxy_listen = 10.0.0.11:10480 reuseport backlog=16384, [fddf:a139:1004:500:11:1111:1111:1111]:10480 reuseport backlog=16384, 10.0.0.11:10443 reuseport backlog=16384, [fddf:a139:1004:500:11:1111:1111:1111]:10443 reuseport backlog=16384
admin_listen = 10.0.0.11:18001 reuseport backlog=16384, [fddf:a139:1004:500:11:1111:1111:1111]:18001 reuseport backlog=16384, 10.0.0.11:18444 http2 ssl reuseport backlog=16384, [fddf:a139:1004:500:11:1111:1111:1111]:18444 http2 ssl reuseport backlog=16384
status_listen = 10.0.0.11:10481, [fddf:a139:1004:500:11:1111:1111:1111]:10481
nginx_worker_processes = 16
nginx_events_worker_connections = 1024
nginx_main_worker_rlimit_nofile = 32768
headers = off
nginx_admin_include = /abc/conf/kong-pu/nginx-admin.conf
nginx_http_lua_shared_dict = prometheus_metrics 20m
database = cassandra
cassandra_contact_points = 10.0.0.83, 10.0.0.84, 10.0.0.85, 10.0.0.86
cassandra_port = 9042
cassandra_keyspace = kong_pu
cassandra_username = kong_pu
cassandra_password = topsecret1234
db_update_frequency = 5
db_update_propagation = 2
worker_consistency = eventual
lua_package_path = ./?.lua;./?/init.lua;/pkg/moip/env/kong/plugins/kong-plugin-redirect/?.lua;/pkg/moip/env/kong/plugins/kong-page-mgmt-redirect/?.lua;/pkg/moip/env/kong/plugins/kong-plugin-error-transformer-handler/?.lua;;

Thank you so much Henning

bungle commented 3 years ago

As I said we already increased worker processes to 16 per node and also in this case all were more or less busy.

It is most probably a router rebuild that is happening there. Router is rebuilt on each worker when you change settings (more workers more work to be done). If you change things in batch, e.g. multiple changes, it may rebuilt router multiple times. Eventual rebuilding happens on background, but it is still work to be done.

23henne commented 3 years ago

Sure, but it should never block complete environment for several minutes.

23henne commented 3 years ago

We set mem_cache_size = 8192m but still faced an outage on production today.

This is what was issued via admin API.

[11/Aug/2021:08:00:13 +0200] "PUT /services/some-service/routes/some-route-2786 HTTP/1.1" 200 599 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "GET /routes/1de06a5d-3779-4208-91a7-7bce524e86c7/plugins HTTP/1.1" 200 587 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /routes/1de06a5d-3779-4208-91a7-7bce524e86c7/plugins/b726245f-66d4-46da-9f68-d567b8a7e6cf HTTP/1.1" 200 564 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /services/some-service/routes/some-route-2787 HTTP/1.1" 200 599 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "GET /routes/d9e0fe1d-0e33-4276-8cc3-89c7feb6d210/plugins HTTP/1.1" 200 587 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /routes/d9e0fe1d-0e33-4276-8cc3-89c7feb6d210/plugins/e5c84ac7-5bd2-4626-8420-4e7ee2b7a8c4 HTTP/1.1" 200 564 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /services/some-service/routes/some-route-2790 HTTP/1.1" 200 601 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "GET /routes/e1477a2e-6279-4130-bffb-d01326c869bc/plugins HTTP/1.1" 200 589 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /routes/e1477a2e-6279-4130-bffb-d01326c869bc/plugins/08174b71-13fb-4d5b-a5c3-47fca35a7683 HTTP/1.1" 200 566 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /services/some-service/routes/some-route-2788 HTTP/1.1" 200 599 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "GET /routes/f80dcada-4bb6-4e06-a3dd-9f50cf7274ad/plugins HTTP/1.1" 200 587 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /routes/f80dcada-4bb6-4e06-a3dd-9f50cf7274ad/plugins/b1e0e5b2-47ac-4155-b948-1ed80fbbbf4b HTTP/1.1" 200 564 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /services/some-service/routes/some-route-2789 HTTP/1.1" 200 600 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "GET /routes/55915ac9-c788-4ba2-a5d6-6a44f68cbc2a/plugins HTTP/1.1" 200 588 "-" "Java-SDK"
[11/Aug/2021:08:00:13 +0200] "PUT /routes/55915ac9-c788-4ba2-a5d6-6a44f68cbc2a/plugins/6965553d-d492-4deb-9ffe-b0cc303d4fb2 HTTP/1.1" 200 565 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "PUT /services/some-service/routes/some-route-2791 HTTP/1.1" 200 602 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "GET /routes/5612edf4-9554-4da2-95bf-862a944fb1a5/plugins HTTP/1.1" 200 590 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "PUT /routes/5612edf4-9554-4da2-95bf-862a944fb1a5/plugins/3647df43-123f-4bf6-98cd-5d49380c77e7 HTTP/1.1" 200 567 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "PUT /services/some-service/routes/some-route-3301 HTTP/1.1" 200 519 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "GET /routes/d2dcaf9f-2fae-4492-b49d-4af7a5af0085/plugins HTTP/1.1" 200 23 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "POST /routes/d2dcaf9f-2fae-4492-b49d-4af7a5af0085/plugins HTTP/1.1" 201 445 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "PUT /services/some-service/routes/some-route-299 HTTP/1.1" 200 510 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "GET /routes/26061cf8-79e2-44b5-8af1-7aed340841fb/plugins HTTP/1.1" 200 23 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "POST /routes/26061cf8-79e2-44b5-8af1-7aed340841fb/plugins HTTP/1.1" 201 397 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "GET /routes?tags=some-tag HTTP/1.1" 200 54062 "-" "Java-SDK"
[11/Aug/2021:08:00:14 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQwYjQ2MzIzNS00N2E1LTQwNWItYTU1NC02ZGFlOTM1NDk3NmTwf%2F%2F%2Fm%2FB%2F%2F%2F%2Bb&tags=some-tag HTTP/1.1" 200 53703 "-" "Java-SDK"
[11/Aug/2021:08:00:15 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQxNGU2MGY0My0yZmQwLTQ3OWUtOTQ3Ni03MDdhZDE4OWI1Y2Pwf%2F%2F%2FN%2FB%2F%2F%2F83&tags=some-tag HTTP/1.1" 200 53918 "-" "Java-SDK"
[11/Aug/2021:08:00:15 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQxZTE0Y2NhMi00MmIyLTQyMjEtOWU2Ni1mZGE4OTMxZWEwZGHwf%2F%2F%2B0%2FB%2F%2F%2F7T&tags=some-tag HTTP/1.1" 200 53908 "-" "Java-SDK"
[11/Aug/2021:08:00:15 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQyN2I2MDRjZi05MzA4LTQ4ODQtOGE4Zi01YjU4ZDc2NDE4Nzbwf%2F%2F%2Bb%2FB%2F%2F%2F5v&tags=some-tag HTTP/1.1" 200 54346 "-" "Java-SDK"
[11/Aug/2021:08:00:16 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQyZjc5ZTA4Zi03MTM2LTQ2OTItYjg1Yi05NDk1NTBhNzc1OWTwf%2F%2F%2BC%2FB%2F%2F%2F4L&tags=some-tag HTTP/1.1" 200 53566 "-" "Java-SDK"
[11/Aug/2021:08:00:16 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQzOTJmZDE0OC0zMDFjLTQ2YmItODYwNi04NmEzMmM2N2UzOTnwf%2F%2F9p%2FB%2F%2F%2F2n&tags=some-tag HTTP/1.1" 200 53576 "-" "Java-SDK"
[11/Aug/2021:08:00:17 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ0MjEwNzhhOC02OTMwLTQ0ZGEtYmMzYy1hMjdhMzk2MzY4ZmXwf%2F%2F9Q%2FB%2F%2F%2F1D&tags=some-tag HTTP/1.1" 200 53731 "-" "Java-SDK"
[11/Aug/2021:08:00:17 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ0YWNjOGVjMi1jMmFiLTQ5M2MtOTU4NC03NDUzZjFhOWE0ZmHwf%2F%2F83%2FB%2F%2F%2Fzf&tags=some-tag HTTP/1.1" 200 53932 "-" "Java-SDK"
[11/Aug/2021:08:00:17 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ1NDE3OWZlMC01MDFjLTQzNWMtODIwZC00NGE4ZDM3NzQxM2bwf%2F%2F8e%2FB%2F%2F%2Fx7&tags=some-tag HTTP/1.1" 200 53911 "-" "Java-SDK"
[11/Aug/2021:08:00:18 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ1YzliNjU2Ny1mNjU2LTQ2MmYtOGQ3Mi00ZGU1ZGI1ODFmNTbwf%2F%2F8F%2FB%2F%2F%2FwX&tags=some-tag HTTP/1.1" 200 53984 "-" "Java-SDK"
[11/Aug/2021:08:00:18 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ2NjY0NTI0Ni1hNWQzLTRlNWYtOTRiMi1kMjI2Y2Q0ZTFlMTTwf%2F%2F7s%2FB%2F%2F%2Fuz&tags=some-tag HTTP/1.1" 200 53745 "-" "Java-SDK"
[11/Aug/2021:08:00:18 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ2ZjQ0ZTgwNC1lZTdjLTRlNTEtODdlMS1lYzAxMmFjOWRmMmbwf%2F%2F7T%2FB%2F%2F%2FtP&tags=some-tag HTTP/1.1" 200 53849 "-" "Java-SDK"
[11/Aug/2021:08:00:19 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ3N2JiNGQ4OS1iMDZkLTRiZTktOWY1MS01NmIzYzU1Y2MyNTbwf%2F%2F66%2FB%2F%2F%2Frr&tags=some-tag HTTP/1.1" 200 53931 "-" "Java-SDK"
[11/Aug/2021:08:00:19 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ4MGMyZDEwYS0zN2M4LTRlNGMtYWI3MC05NDgxYjJmMTY4N2Lwf%2F%2F6h%2FB%2F%2F%2FqH&tags=some-tag HTTP/1.1" 200 53852 "-" "Java-SDK"
[11/Aug/2021:08:00:20 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ4OTQ1MzQyMS05YThmLTQwNDEtOTBlOS05MGM1YTEyMDhlYWHwf%2F%2F6I%2FB%2F%2F%2Foj&tags=some-tag HTTP/1.1" 200 53534 "-" "Java-SDK"
[11/Aug/2021:08:00:20 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ5MzcxMTExZC01YTczLTRkOTUtYTQ1ZC0wNTE4NzkxMWMzYTLwf%2F%2F5v%2FB%2F%2F%2Fm%2F&tags=some-tag HTTP/1.1" 200 53720 "-" "Java-SDK"
[11/Aug/2021:08:00:16 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQzOTJmZDE0OC0zMDFjLTQ2YmItODYwNi04NmEzMmM2N2UzOTnwf%2F%2F9p%2FB%2F%2F%2F2n&tags=some-tag HTTP/1.1" 200 53576 "-" "Java-SDK"
[11/Aug/2021:08:00:17 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ0MjEwNzhhOC02OTMwLTQ0ZGEtYmMzYy1hMjdhMzk2MzY4ZmXwf%2F%2F9Q%2FB%2F%2F%2F1D&tags=some-tag HTTP/1.1" 200 53731 "-" "Java-SDK"
[11/Aug/2021:08:00:17 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ0YWNjOGVjMi1jMmFiLTQ5M2MtOTU4NC03NDUzZjFhOWE0ZmHwf%2F%2F83%2FB%2F%2F%2Fzf&tags=some-tag HTTP/1.1" 200 53932 "-" "Java-SDK"
[11/Aug/2021:08:00:17 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ1NDE3OWZlMC01MDFjLTQzNWMtODIwZC00NGE4ZDM3NzQxM2bwf%2F%2F8e%2FB%2F%2F%2Fx7&tags=some-tag HTTP/1.1" 200 53911 "-" "Java-SDK"
[11/Aug/2021:08:00:18 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ1YzliNjU2Ny1mNjU2LTQ2MmYtOGQ3Mi00ZGU1ZGI1ODFmNTbwf%2F%2F8F%2FB%2F%2F%2FwX&tags=some-tag HTTP/1.1" 200 53984 "-" "Java-SDK"
[11/Aug/2021:08:00:18 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ2NjY0NTI0Ni1hNWQzLTRlNWYtOTRiMi1kMjI2Y2Q0ZTFlMTTwf%2F%2F7s%2FB%2F%2F%2Fuz&tags=some-tag HTTP/1.1" 200 53745 "-" "Java-SDK"
[11/Aug/2021:08:00:18 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ2ZjQ0ZTgwNC1lZTdjLTRlNTEtODdlMS1lYzAxMmFjOWRmMmbwf%2F%2F7T%2FB%2F%2F%2FtP&tags=some-tag HTTP/1.1" 200 53849 "-" "Java-SDK"
[11/Aug/2021:08:00:19 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ3N2JiNGQ4OS1iMDZkLTRiZTktOWY1MS01NmIzYzU1Y2MyNTbwf%2F%2F66%2FB%2F%2F%2Frr&tags=some-tag HTTP/1.1" 200 53931 "-" "Java-SDK"
[11/Aug/2021:08:00:19 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ4MGMyZDEwYS0zN2M4LTRlNGMtYWI3MC05NDgxYjJmMTY4N2Lwf%2F%2F6h%2FB%2F%2F%2FqH&tags=some-tag HTTP/1.1" 200 53852 "-" "Java-SDK"
[11/Aug/2021:08:00:20 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ4OTQ1MzQyMS05YThmLTQwNDEtOTBlOS05MGM1YTEyMDhlYWHwf%2F%2F6I%2FB%2F%2F%2Foj&tags=some-tag HTTP/1.1" 200 53534 "-" "Java-SDK"
[11/Aug/2021:08:00:20 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ5MzcxMTExZC01YTczLTRkOTUtYTQ1ZC0wNTE4NzkxMWMzYTLwf%2F%2F5v%2FB%2F%2F%2Fm%2F&tags=some-tag HTTP/1.1" 200 53720 "-" "Java-SDK"
[11/Aug/2021:08:00:20 +0200] "GET /routes?offset=AC0ABnJvdXRlcyQ5ZDUzNTdkZi1jNzMyLTQ5NDYtYmYwOC04MmQ1N2E0ODE2OTHwf%2F%2F5W%2FB%2F%2F%2Flb&tags=some-tag HTTP/1.1" 200 54009 "-" "Java-SDK"
[11/Aug/2021:08:00:20 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRhNTkzYWE5OC05N2MwLTQ5NWEtYWU0My05MjliZjg3MjAyNjHwf%2F%2F49%2FB%2F%2F%2Fj3&tags=some-tag HTTP/1.1" 200 53762 "-" "Java-SDK"
[11/Aug/2021:08:00:21 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRhYzc5YTMwZS0wZGM2LTRmZTQtYWMzOC1mZjgyYzkwODRjYWTwf%2F%2F4k%2FB%2F%2F%2FiT&tags=some-tag HTTP/1.1" 200 54056 "-" "Java-SDK"
[11/Aug/2021:08:00:21 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRiNTYyY2E4YS0zMzU3LTRkMjktODFmZS0wODZhYTEyZTFiMzLwf%2F%2F4L%2FB%2F%2F%2Fgv&tags=some-tag HTTP/1.1" 200 54100 "-" "Java-SDK"
[11/Aug/2021:08:00:21 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRiZWEzNjI0OC0xZGZmLTQwZTgtYWRlNS02MGUzMDEwODc0NTbwf%2F%2F3y%2FB%2F%2F%2FfL&tags=some-tag HTTP/1.1" 200 54014 "-" "Java-SDK"
[11/Aug/2021:08:00:22 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRjNmU5ZDFiOS03ZjhkLTRlZGUtOTk4OS04MjdhODg0ZmI0NTHwf%2F%2F3Z%2FB%2F%2F%2Fdn&tags=some-tag HTTP/1.1" 200 54198 "-" "Java-SDK"
[11/Aug/2021:08:00:23 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRkMDA3YTNkYy0xODBmLTRiZTYtOGRkZC1lNjdiY2M1ZWFmYTfwf%2F%2F3A%2FB%2F%2F%2FcD&tags=some-tag HTTP/1.1" 200 53993 "-" "Java-SDK"
[11/Aug/2021:08:00:23 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRkOGJkMWEzYy02OTBmLTQyZWEtYmE3OC01YzcwZTUxMTNkMTDwf%2F%2F2n%2FB%2F%2F%2Faf&tags=some-tag HTTP/1.1" 200 53795 "-" "Java-SDK"
[11/Aug/2021:08:00:24 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRkZmUzODk4Mi02ZDhkLTQ0NTUtOTBjZi03NjAzYTI5ODI4Njjwf%2F%2F2O%2FB%2F%2F%2FY7&tags=some-tag HTTP/1.1" 200 53851 "-" "Java-SDK"
[11/Aug/2021:08:00:24 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRlYTcwYzdlZS1lNDI2LTQ5ZDctYWRjNC03NjhhZTVkNGM3Y2Pwf%2F%2F11%2FB%2F%2F%2FXX&tags=some-tag HTTP/1.1" 200 53985 "-" "Java-SDK"
[11/Aug/2021:08:00:24 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRmMzY0ZDA3OC04ZGYwLTRmNzMtYjI3NS1iZDg5NTY5MmU0MWbwf%2F%2F1c%2FB%2F%2F%2FVz&tags=some-tag HTTP/1.1" 200 53999 "-" "Java-SDK"
[11/Aug/2021:08:00:25 +0200] "GET /routes?offset=AC0ABnJvdXRlcyRmYjk5YjQ1Zi04MDI5LTQ1ZWItOGUzZi1jZTAwZjcxMTcyYzjwf%2F%2F1D%2FB%2F%2F%2FUP&tags=some-tag HTTP/1.1" 200 28059 "-" "Java-SDK"
[11/Aug/2021:08:00:27 +0200] "DELETE /routes/4d34120f-237c-4a39-8d89-0adc6fcd7efa HTTP/1.1" 204 0 "-" "Java-SDK"
[11/Aug/2021:08:00:30 +0200] "DELETE /routes/8b986421-1f30-4614-86f0-dd103d27073a HTTP/1.1" 204 0 "-" "Java-SDK"
[11/Aug/2021:08:00:30 +0200] "DELETE /routes/8bd636fa-4826-429f-bcc0-e11815683c8a HTTP/1.1" 204 0 "-" "Java-SDK"
23henne commented 3 years ago

I think it's normal behaviour that we see thousands of entries like below while running an strace on some nginx worker process, isn't it?

accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x2101a6f3, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x2101a6f3, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x2101a6f3, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x2101a6f3, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x2100c5e3, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x1dbf5553, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x1c5be343, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x4bf9313, 16709)             = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x44420463, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x1fc7ed33, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
read(149, 0x4c883783, 16709)            = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(26, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
writev(49, [{iov_base="igation__phoe20{position:relativ"..., iov_len=512}, {iov_base="r.navigation--scrolled .navigati"..., iov_len=4096}, {iov_base="vigation__1st-level-subnav:after"..., iov_len=4096}, {iov_base="media (max-width:63.9375em){.nav"..., iov_len=4096}, {iov_base="eded;padding:0;margin:11px 66px "..., iov_len=4096}, {iov_base=".firefox .navigation__switch-bar"..., iov_len=2458}], 6) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(56, 0x7fff1b8bcd10, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
dndx commented 3 years ago

@23henne because Nginx uses non blocking I/O, EAGAIN on accept is actually normal. It just means the server is busy handling a lot of connections.

One way to rule out the router rebuild as the culprit is to add debug logs to print out timestamps around calls to https://github.com/Kong/kong/blob/master/kong/runloop/handler.lua#L671. If the router rebuild is slow either due to slow DB or slow rebuild, it should be very obvious.

If possible, you can also run https://github.com/kong/stapxx#lj-lua-stacks tool on the offending instance according to the directions to capture the CPU flame graph of the LuaJIT VM, it will also tell you where the CPU time has been spent on.

23henne commented 3 years ago

@dndx We will implement logging enhancement by next week. Besides that, did you have a change to take a look at DEBUG logs I sent you via mail? What is also not clear to us - is there any kind of inter-node communcation bypassing database?

marcinkierus commented 3 years ago

Finally, we were able to reproduce the problem on test environments and find the root cause. It is related to the local cache in router.lua, and more precisely to the way it is built.

After update or kong restart, this cache is built from scratch and each incoming request have to review all routes definitions to find the one that matches it. The matching algorithm is that paths definitions in routes which are regular expressions are matched at the beginning and the prefixed definitions (without wildcards) are matched after that.

So, for each incoming URI, which is not in the router local cache and does not match any definition that is the regular expression, but does match the definition that is the prefix, kong have to parse all the regular expressions anyway, which costs a lot of CPU time.

If we have many regex paths definitions (in our case it is more than 3000), it takes a lot of time. In combination with high incoming traffic it causes a rapid increase in CPU usage.

The situation is made worse by the fact that each nginx worker has a separate local cache to build.

An additional side effect we discovered is that sending many requests which generate 404 errors may lead to a DoS because for each such request, kong has to review all route definitions every time.

dndx commented 3 years ago

@marcinkierus Glad you were able to find the issue. Regex routes in general are very expensive, because they are hard to parallelize. Even softwares like Nginx will run those one by one, essentially being a O(n) solution.

There are possibly some ways to combine multiple regexes into a few bigger automata that can be run simultaneously, but it will require some custom regex engine to work.

Out of curiosity, could you share with us what your 3000 regex routes looks like? Is there any similarities between them that may enable a more efficient search algorithm (e.g. radix tree)?

maro2289 commented 3 years ago

@dndx I'd like to create an artificial example to explain our use case, since our ruleset is changing frequently and optimizing one time might not be helpful on the long run:

There are multiple base path we'd like to route to a service serving a specific part our our web portal: eg. /products This example service is in general responsible for serving html pages about products / product configurations. Now we are required to have the possibility to route any specific part below /products to a different location (redirect) or other service if needed. Importent part: it should be an exact match. Example: /products/redirect should be redirected but /products/redirect/normal-service should be covered by our original rule for /products.

To achieve this we are currently forced to use regex for exact match, but if we define our rules (like /products) with high trafic as a prefix we observe the performance problems explained above by @marcinkierus, since it's processed at the end of the ruelset. To reduce our problem we currently define high traffic routes which don't have the need for exact matches as regex with priority 1. eg /static(/|$)

In general the number of exact match rules might grow and an alternative approach would be helpful to do exact matches in a less cpu hungry approach.

(I hope I was able to clarify our general situation)

maro2289 commented 3 years ago

@dndx after further investigation it turned out, that our problem might be related to https://github.com/Kong/kong/issues/7482

First tests are showing that after increasing lua_regex_cache_max_entries to a value greater than the amount of all our existing regex rules, latency and cpu usage look way better during updating entities. Since I consider this more as a workaround than as a solution, I am wondering if this is a limitation of Kong in terms of regex usage? Default value of lua_regex_cache_max_entries is 1024 and our problems started around the time when number of rules exceeded this value.

kanbo commented 2 years ago

@23henne Have you solved this problem? I have a similar problem. Update Kong.yml every second , http 503 error will appear during pressure test, Kong version 2.8.1,dbless mode

Shiinii commented 2 years ago

@kanbo please see the latest reply from @maro2289 for a possible workaround of the described issue.

LL1024LL commented 1 year ago

We have similar problem. After we edit a plugin config , the related route will perform high latency in the first request. We fixed by this related pr:

StarlightIbuki commented 1 year ago

@ADD-SP Could you take a look and update?

github-actions[bot] commented 11 months ago

This issue is marked as stale because it has been open for 14 days with no activity.

github-actions[bot] commented 11 months ago

Dear contributor,

We are automatically closing this issue because it has not seen any activity for three weeks. We're sorry that your issue could not be resolved. If any new information comes up that could help resolving it, please feel free to reopen it.

Your contribution is greatly appreciated!

Please have a look our pledge to the community for more information.

Sincerely, Your Kong Gateway team