Open Sfonxs opened 3 years ago
We're facing the same issue with a python web application and the same ingress/service configuration, also changing the request timeout has a similar effect for us. I want to add that the channel is being closed with the code:
1006 - Abnormal Closure - Indicates that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected.
@urucoder Are you using the classic Application Gateway or do you use AGIC?
@baptistepattyn We use AGIC
I have been in contact with support and together we found out the root cause of the issue: Some unrelated pod in an unrelated namespace was in a crash loop, and every time it tried to start up, the AGIC picked up the new POD IP and pushed the changes to the Application Gateway. But the application gateway will disconnect all open websockets connection for any update to any backend pool.
Fixing the crash loop fixed the disconnect problem.
With this info we were also able to confirm that starting/stopping any pods with a linked ingress controller configuration in the whole cluster, will disconnect all websockets of the application gateway. This behavior is very inconvenient and unexpected, and I still see this as an issue.
@Sfonxs - was the AGIC set to watch all namespaces or was the namespace that had the pod in a crash loop outside of the AGIC scope?
Thanks for reaching out @mscatyao AGIC was set up to watch all namespaces, as all namespaces have at least some pods that need to have the AGIC ingress controllers. In this case the crashing pod also had a matching AGICs.
When the pod crashes and starts up again, does the Pod IP change every time or remain the same?
The pod IP stays the same, it is the container within the pod that kept crashing
i tried to get this flaw in application gateway fixed via a support ticket but they confirmed it´s designed behaviour and recommende to open a user feedback
https://feedback.azure.com/d365community/idea/a2dcfc40-ba9f-ec11-a81c-000d3adfb8f5
so dont use AGIC if you have many websockets client that should not reconnect due to the slightest change in the ingress rules
i tried to get this flaw in application gateway fixed via a support ticket but they confirmed it´s designed behaviour and recommende to open a user feedback
https://feedback.azure.com/d365community/idea/a2dcfc40-ba9f-ec11-a81c-000d3adfb8f5
so dont use AGIC if you have many websockets client that should not reconnect due to the slightest change in the ingress rules
Thanks for finding this out. So this also means, all websocket connections are reset when we scale pods? So this makes this setup unusable for dynamic scaling?
I created a stackoverflow question describing my repro here
As this was still an ongoing issue that gives us big problems for our websocket stability, we had to completely step away from AGIC. So now we have put the whole AppGw in a bicep and route to our LoadBalancer services in the AKS cluster.
So in conclusion, if you use websockets at all, do not use AGIC as it will eventually give you stability problems. Every time HPA triggers or a new container is deployed, all websockets in the whole cluster will be disconnected.
So for me this issue can be closed as we no longer use this service...
Any update on this being fixed?
We are also encountering the same issue. Any update on when this should be fixed?
Describe the bug We have a AKS setup with AGIC. There is a pod running an ASP.NET 5 application. In front of this pod there is a service:
In front of this service there is the following ingress configuration:
Running this setup, we can see that the ASP.NET application reports a consistent disconnect of WebSockets every 45 seconds. When adding the following annotation:
The disconnect interval changes to 15 seconds. When changing the request-timeout to:
The disconnect interval is back to 45 seconds.
When using "kubectl proxy" to go directly to the K8S service, the websocket does not disconnect, which seems to indicate the issue is related to AGIC.
These disconnects also happen when the websocket is actively used, with the same interval.
Additional incident On our other AKS cluster (which hosts our production environment), with the same AGIC setup and the same containers/pods, these disconnects do not happen. The websockets just stays open. However, during a specific incident from 20 September, 9AM to 21 September, 4PM where we did not change any of our deployments, these disconnects starting occurring. These disconnects fixed themselves and the cause remains unknown. (The screenshot comes from our own custom logging in a Log Analytics Workspace.)
To Reproduce Setup a pod that can handle websocket connections and notice that the websocket are closed every X time, regardless of traffic going over the websockets.
Ingress Controller details