Kong / kong

🦍 The Cloud-Native API Gateway and AI Gateway.
https://konghq.com/install/#kong-community
Apache License 2.0
38.83k stars 4.77k forks source link

Kong Gateway API with 104 Connection reset by peer message to OpenShift IngressController #13521

Closed jrosal06 closed 1 week ago

jrosal06 commented 3 weeks ago

In OpenShift cluster we have installed Kong Gateway API, after the cluster upgrede from OpenShift 4.12 (Kubernetes 1.25) to OpenShift 4.14 (Kubernetes 1.27) that was carried out to this cluster it is observed that it is sending the following error:

2024/07/18 18:14:31 [info] 1634#0: *14053057 client 10.127.4.2 closed keepalive connection (104: Connection reset by peer)

2024/07/21 18:10:48 [info] 2606#0: *38884119 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too (104: Connection reset by peer) while connecting to upstream, client: 10.127.4.2, server: kong, request: "GET /provisioning-services/api/health HTTP/1.1", upstream: "http://10.123.135.29:1084/provisioning-services/api/health", host: "xxxxxxxxx"

And that in it the ingress controller is updated from Haproxy 2.4 to Haproxy 2.6.

The confusing part here is that, in the cluster with version 4.14, it doesn't show FIN, ACK TCP packet before closing the connection (RST), but it does in the cluster with version 4.12. We are talking about the application having the same version (Kong 2.5) in both clusters but we only see this atypical behavior in the cluster with version 4.14. There must be some change you aren't identifying at the HAProxy level. It is fundamental to find the justification for why it behaves this way in different versions of OpenShift.

`Before:

"No.","Time","Source","Destination","Protocol","Length","Info" "516","11:38:21,386258","10.126.2.2","10.125.7.54","TCP","80","58248 > 8080 [SYN] Seq=4129447575 Win=26580 Len=0 MSS=8860 SACK_PERM TSval=3126154924 TSecr=0 WS=128" "517","11:38:21,386510","10.125.7.54","10.126.2.2","TCP","80","8080 > 58248 [SYN, ACK] Seq=3249228591 Ack=4129447576 Win=26544 Len=0 MSS=8860 SACK_PERM TSval=1266477712 TSecr=3126154924 WS=128" "518","11:38:21,386768","10.126.2.2","10.125.7.54","TCP","72","58248 > 8080 [ACK] Seq=4129447576 Ack=3249228592 Win=26624 Len=0 TSval=3126154924 TSecr=1266477712" "519","11:38:21,386813","10.126.2.2","10.125.7.54","HTTP","387","GET /provisioning-services/api/health HTTP/1.1 " "520","11:38:21,386849","10.125.7.54","10.126.2.2","TCP","72","8080 > 58248 [ACK] Seq=3249228592 Ack=4129447891 Win=27648 Len=0 TSval=1266477712 TSecr=3126154924" "736","11:38:22,455572","10.125.7.54","10.126.2.2","HTTP/JSON","637","HTTP/1.1 200 OK , JSON (application/json)" "737","11:38:22,455873","10.126.2.2","10.125.7.54","TCP","72","58248 > 8080 [ACK] Seq=4129447891 Ack=3249229157 Win=27776 Len=0 TSval=3126155993 TSecr=1266478781" "738","11:38:22,456098","10.126.2.2","10.125.7.54","TCP","72","58248 > 8080 [FIN, ACK] Seq=4129447891 Ack=3249229157 Win=27776 Len=0 TSval=3126155994 TSecr=1266478781" "739","11:38:22,456144","10.125.7.54","10.126.2.2","TCP","72","8080 > 58248 [FIN, ACK] Seq=3249229157 Ack=4129447892 Win=27648 Len=0 TSval=1266478781 TSecr=3126155994" "740","11:38:22,456266","10.126.2.2","10.125.7.54","TCP","72","58248 > 8080 [RST, ACK] Seq=4129447892 Ack=3249229157 Win=27776 Len=0 TSval=3126155994 TSecr=1266478781" "741","11:38:22,456569","10.126.2.2","10.125.7.54","TCP","60","58248 > 8080 [RST] Seq=4129447892 Win=0 Len=0"

After

"No.","Time","Source","Destination","Protocol","Length","Info" "607","11:40:43,122819","10.126.2.2","10.124.7.57","TCP","80","47892 > 8080 [SYN] Seq=2905734366 Win=62020 Len=0 MSS=8860 SACK_PERM TSval=4260484801 TSecr=0 WS=128" "608","11:40:43,122852","10.124.7.57","10.126.2.2","TCP","80","8080 > 47892 [SYN, ACK] Seq=2819640922 Ack=2905734367 Win=61936 Len=0 MSS=8860 SACK_PERM TSval=679465162 TSecr=4260484801 WS=128" "609","11:40:43,123024","10.126.2.2","10.124.7.57","TCP","72","47892 > 8080 [ACK] Seq=2905734367 Ack=2819640923 Win=62080 Len=0 TSval=4260484801 TSecr=679465162" "610","11:40:43,123128","10.126.2.2","10.124.7.57","HTTP","387","GET /provisioning-services/api/health HTTP/1.1 " "611","11:40:43,123136","10.124.7.57","10.126.2.2","TCP","72","8080 > 47892 [ACK] Seq=2819640923 Ack=2905734682 Win=61696 Len=0 TSval=679465162 TSecr=4260484801" "857","11:40:44,173392","10.124.7.57","10.126.2.2","HTTP/JSON","637","HTTP/1.1 200 OK , JSON (application/json)" "858","11:40:44,173589","10.126.2.2","10.124.7.57","TCP","72","47892 > 8080 [ACK] Seq=2905734682 Ack=2819641488 Win=61568 Len=0 TSval=4260485852 TSecr=679466213" "966","11:40:48,092830","10.126.2.2","10.124.7.57","TCP","72","47892 > 8080 [RST, ACK] Seq=2905734682 Ack=2819641488 Win=61568 Len=0 TSval=4260489771 TSecr=679466213"`

chobits commented 3 weeks ago

Hi @jrosal06 Out of topic, I think before you post your raw data of tcpdump info, you need to explain the detailed of your tcpdump info. I felt quite frustrated when looking at your tcpdump information.

  1. The text in the data stream isn't formatted, which makes reading it quite straining on the eyes.
  2. I tried to understand which IP is from client and which IP is from server.
  3. The compliatedFIN/FIN/RST/RST stream from the before info, you should know that not everyone is TCP/IP expert, they could feel it is hard to understand it.

Instead of just posting the raw tcpdump data, could you provide some explanations on key entries? For example, point out a specific RST packet and explain that it should have been a FIN, but instead, one side (possibly the client or server) sent an RST first.


From your description and after's last entry "966","11:40:48,092830","10.126.2.2","10.124.7.57","TCP","72","47892 > 8080 [RST, ACK] Seq=2905734682 Ack=2819641488 Win=61568 Len=0 TSval=4260489771 TSecr=679466213"

I guess your question is that why the client (10.126.2.2)/haproxy sends RST to kong actively.

But from your information, we can't figure out the cause of this phenomenon. And the kong's error log just clarify this situation. So I think we might look into the difference between different versions of haproxy. But in Kong's community, I think we might only be able to offer some guesses, as the issue seems to be related to the load balancer layer (HAProxy). From the Kong side, we can't determine the cause; we can only confirm that the problem is occurring.

jrosal06 commented 3 weeks ago

Hello, sorry for the inconvenience.

Basically, the server 10.126.2.2 (where the HAProxy) is not sending the FIN packet to the client 10.124.7.57 (Kong API Gateway Pod in OpenShift). We interpret that the absence of the FIN packet in the TCP connection causes the message (104: Connection reset by peer). The application receives RST packet only during the TCP 3-way handshake.

All this is in the same OpenShift cluster, this has been happening since we updated the cluster version from 4.12 (Kubernetes 1.25 - HAProxy 2.4) to version 4.14 (Kubernetes 1.27 - HAProxy 2.6). This behavior only occurs in OpenShift version 4.14, in version 4.12 it does the normal TCP connection closure (FIN/FIN/RST/RST).

We want your help to know if in Kubernetes versions cluster greater than 1.27 (OpenShift 4.14) some parameter should be adjusted in Kong to mitigate and clear this alarm (104: Connection reset by peer).

Thank you.

chobits commented 3 weeks ago

We want your help to know if in Kubernetes versions cluster greater than 1.27 (OpenShift 4.14) some parameter should be adjusted in Kong to mitigate and clear this alarm (104: Connection reset by peer).

The 104: Connection reset by peer issue is caused by client behavior and might not be resolved from the Kong side.

Kong logs this issue at the info level, but if you want to reduce the occurrence of this log entry, you can increase Kong's log level to warn or higher. This will minimize the logging of such informational messages.

jrosal06 commented 3 weeks ago

It is extraordinary that, in Kubernetes version 1.25 (OpenShift 4.12) this error 104: Connection reset by peer does not appear, I remember this error we see in Kubernetes cluster version 1.27 (OpenShift 4.14), and could have changed in the header that Kong sends to the client (OpenShift infra node with HAProxy 2.6.13). Is it possible that fine-tuning has to be added to Kong's parameters?

chobits commented 2 weeks ago

hi @jrosal06

After re-reviewing your tcpdump info and your kong log message, i found some new interesting thing, which might give you some hints to troubleshoot further.

From your tcpdump package (2nd after log from your description of this issue), we know that the flow looks as follows

1. 11:40:43, client open a connection to server (tcp 3-handshake)
2. 11:40:43, client send request to kong, like `GET /provisioning-services/api/health HTTP/1.1`
3. 11:40:44, kong replied response to client, like `HTTP/1.1 200 OK , JSON (application/json)`
^
- 4s - delay
\/
5. 1:40:48, client send RST to kong to shutdown the connection ungracefully

So from the data flow and kong's log epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too (104: Connection reset by peer) while connecting to upstream, it looks that kong tried to connect to upstream but could not establish the connection. But I dont know why kong does not establish the upstream connection, the client still could get the request header HTTP/1.1 200 OK , JSON (application/json). And could you capture the whole package to detect whether the client received the whole response header + body or only the HTTP header?

And I also recommend to resend the health check request on different nodes across the entire path to observe their behavior. For client (1)-> haproxy -(2)> kong -(3)> upstream, you could send the health check request on the point (1) (2) (3) to observe the bebaviour. WIth this, we could detect if there is any abnormality wrong between nodes (1)(2)(3).

Is it possible that fine-tuning has to be added to Kong's parameters?

Since Kong was not modified in your case, but HAProxy was, I strongly suspect that something unexpected is happening in the modified HAProxy.

jrosal06 commented 2 weeks ago

Hello @chobits

We see the full HTTP header in the response from the client 10.126.2.2

+-----+-----------------+-------------+-------------+-----------+--------+------------------------------------------------------------------------------------------------------------------------------------+
| No. |       Time      |    Source   | Destination |  Protocol | Length |                                                                Info                                                                |
+-----+-----------------+-------------+-------------+-----------+--------+------------------------------------------------------------------------------------------------------------------------------------+
|  1  | 17:28:42,935081 |  10.126.2.2 | 10.124.6.52 |    TCP    |   80   |               58518  >  8080 [SYN] Seq=275294496 Win=62020 Len=0 MSS=8860 SACK_PERM TSval=4082323837 TSecr=0 WS=128                |
|  2  | 17:28:42,935105 | 10.124.6.52 |  10.126.2.2 |    TCP    |   80   | 8080  >  58518 [SYN, ACK] Seq=2627929200 Ack=275294497 Win=61936 Len=0 MSS=8860 SACK_PERM TSval=1806877333 TSecr=4082323837 WS=128 |
|  3  | 17:28:42,935461 |  10.126.2.2 | 10.124.6.52 |    TCP    |   72   |                58518  >  8080 [ACK] Seq=275294497 Ack=2627929201 Win=62080 Len=0 TSval=4082323838 TSecr=1806877333                 |
|  4  | 17:28:42,935482 |  10.126.2.2 | 10.124.6.52 |    HTTP   |  387   |                                           GET /provisioning-services/api/health HTTP/1.1                                           |
|  5  | 17:28:42,935488 | 10.124.6.52 |  10.126.2.2 |    TCP    |   72   |                8080  >  58518 [ACK] Seq=2627929201 Ack=275294812 Win=61696 Len=0 TSval=1806877333 TSecr=4082323838                 |
|  6  | 17:28:44,027449 | 10.124.6.52 |  10.126.2.2 | HTTP/JSON |  690   |                                             HTTP/1.1 200 OK , JSON (application/json)                                              |
|  7  | 17:28:44,027744 |  10.126.2.2 | 10.124.6.52 |    TCP    |   72   |                58518  >  8080 [ACK] Seq=275294812 Ack=2627929819 Win=61568 Len=0 TSval=4082324930 TSecr=1806878425                 |
|  8  | 17:28:44,028030 |  10.126.2.2 | 10.124.6.52 |    TCP    |   72   |              58518  >  8080 [RST, ACK] Seq=275294812 Ack=2627929819 Win=61568 Len=0 TSval=4082324931 TSecr=1806878425              |
+-----+-----------------+-------------+-------------+-----------+--------+------------------------------------------------------------------------------------------------------------------------------------+

HTTP stream

GET /provisioning-services/api/health HTTP/1.1
host: apigw.apps.ocp4.cloud.com
x-forwarded-host: apigw.apps.ocp4.cloud.com
x-forwarded-port: 443
x-forwarded-proto: https
forwarded: for=186.55.58.252;host=apigw.apps.ocp4.cloud.com;proto=https
x-forwarded-for: 186.55.58.252

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 312
Connection: keep-alive
Server: gunicorn
Date: Mon, 27 Aug 2024 13:28:44 GMT
X-Frame-Options: SAMEORIGIN
X-Kong-Upstream-Latency: 1091
X-Kong-Proxy-Latency: 1
Via: kong/3.7.1
X-Kong-Request-Id: 98123f71ea9c2140d098e656c9d10f60

{"codigo": 0, "mensaje": "\u00c9xito", "status": "OK", "statusData": [{"name": "fdm", "value": true}, {"name": "rim", "value": true}, {"name": "gis", "value": true}, {"name": "db", "value": true}, {"name": "rsm", "value": true}, {"name": "celeryBroker", "value": true}, {"name": "activeWorkers", "value": true}]}