@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?
@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?
@hanshuebner @nowNick - Yes, dns traffic increased and we tried below and none helped in stopping from kong proxy going into unresponsive mode.
Changed ndots from 5 to 2
dns_valid_ttl: 53
dns_valid_ttl: 30
dns_no_sync: on
dns_valid_ttl: 53
dns_no_sync: on
Reverted dns_valid_ttl and dns_no_sync
dns_order: LAST,A,CNAME
@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?
@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?
We are able to reproduce the issue in our non-prod environment also , with couple of service endpoints with mix of internal and external hostnames (fqdn).
I feel when one/few of the problematic route/service is invoked worker processes or either running out of timers or some capacity which is making it to unresponsive. Meaning at that scenario /status page or any service doesnt respond for ~45 secs.
To test our theory we increased liveness probes period to very long so that kubelet doesnt restart and saw that after 4-5 mins worker processes recovered on its own. So can you please help me figure out if I am running into some resource constraints and what those could be?
Based on our grafana chart and real-time top I dont see high CPU/Memory - CPU max is 500m and Memory max is 512Mi
Not sure why it doesnt grow beyond those values.
The DNS traffic and traffic onto the kong proxy is normal e.g.
Where do you see those?
Nginx timers in grafana dashboard
When Kong hangs, do you see a lot of open network connections?
Ya about 200 time_waits
I see http latency about 100s for few calls, and some times 100% cpu usage for the kong proxy container alone. I am able to reproduce same issue even in 3.2.2 version also. Any help on how i can debug further.
We found the issue with one of the external api timeouts and our custom plugin’s which caused worker processes to go into cpu 100%. Updating those fixed our issue.
@kranthirachakonda Can you please share how do you identify the root cause? I think i'm facing the same issue hare, i'm newer with kong so it will be great if can you help me with some tricks to identify.
Discussed in https://github.com/Kong/kong/discussions/11709
@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?
@hanshuebner approx Routes - 900 Services - 900 Consumers - 100 Plugins - 2900
@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?
@hanshuebner @nowNick - Yes, dns traffic increased and we tried below and none helped in stopping from kong proxy going into unresponsive mode.
Changed ndots from 5 to 2
dns_valid_ttl: 53
dns_valid_ttl: 30 dns_no_sync: on
dns_valid_ttl: 53 dns_no_sync: on
Reverted dns_valid_ttl and dns_no_sync dns_order: LAST,A,CNAME
Reverted dns_order dns_cache_size: 100000
Reverted dns_cache_size dns_stale_ttl: 120
REverted dns_stale_ttl dns_stale_ttl: 127
Reverted all DNS changes
upstream_keepalive_max_requests: 200 upstream_keepalive_pool_size: 120 upstream_keepalive_idle_timeout: 30 lua_socket_pool_size: 60
dns_not_found_ttl: 300
Reverted dns_not_found_ttl dns_error_ttl: 30
Reverted dns_error_ttl lua_socket_pool_size: 127
@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?
We are able to reproduce the issue in our non-prod environment also , with couple of service endpoints with mix of internal and external hostnames (fqdn).
I feel when one/few of the problematic route/service is invoked worker processes or either running out of timers or some capacity which is making it to unresponsive. Meaning at that scenario /status page or any service doesnt respond for ~45 secs. To test our theory we increased liveness probes period to very long so that kubelet doesnt restart and saw that after 4-5 mins worker processes recovered on its own. So can you please help me figure out if I am running into some resource constraints and what those could be?
Based on our grafana chart and real-time top I dont see high CPU/Memory - CPU max is 500m and Memory max is 512Mi Not sure why it doesnt grow beyond those values.
The DNS traffic and traffic onto the kong proxy is normal e.g.
Kong-Nginx timers - I see lots of pending state
Where do you see those?
When Kong hangs, do you see a lot of open network connections?
Ya about 200 time_waits
I see http latency about 100s for few calls, and some times 100% cpu usage for the kong proxy container alone. I am able to reproduce same issue even in 3.2.2 version also. Any help on how i can debug further.
We found the issue with one of the external api timeouts and our custom plugin’s which caused worker processes to go into cpu 100%. Updating those fixed our issue.
@kranthirachakonda Can you please share how do you identify the root cause? I think i'm facing the same issue hare, i'm newer with kong so it will be great if can you help me with some tricks to identify.