WebSockets disconnecting even while in use

Sfonxs commented 3 years ago

Describe the bug We have a AKS setup with AGIC. There is a pod running an ASP.NET 5 application. In front of this pod there is a service:

apiVersion: v1
kind: Service
metadata:
  name: tunnel-service
  annotations:
spec:
  ports:
  - port: 80
    targetPort: public
  selector:
    app: tunnelservice

In front of this service there is the following ingress configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
    name: 1tunnel-ingress
    namespace: dev
    annotations:
        kubernetes.io/ingress.class: azure/application-gateway
        appgw.ingress.kubernetes.io/appgw-ssl-certificate: "***-wildcard"
        appgw.ingress.kubernetes.io/ssl-redirect: "true"
        appgw.ingress.kubernetes.io/connection-draining: "true"
        appgw.ingress.kubernetes.io/connection-draining-timeout: "30"
spec:
    rules:
    - host: tunnel-dev.***
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
                name: tunnel-service
                port:
                    number: 80

Running this setup, we can see that the ASP.NET application reports a consistent disconnect of WebSockets every 45 seconds. When adding the following annotation:

        appgw.ingress.kubernetes.io/request-timeout: "10"

The disconnect interval changes to 15 seconds. When changing the request-timeout to:

        appgw.ingress.kubernetes.io/request-timeout: "120"

The disconnect interval is back to 45 seconds.

When using "kubectl proxy" to go directly to the K8S service, the websocket does not disconnect, which seems to indicate the issue is related to AGIC.

These disconnects also happen when the websocket is actively used, with the same interval.

Additional incident On our other AKS cluster (which hosts our production environment), with the same AGIC setup and the same containers/pods, these disconnects do not happen. The websockets just stays open. However, during a specific incident from 20 September, 9AM to 21 September, 4PM where we did not change any of our deployments, these disconnects starting occurring. These disconnects fixed themselves and the cause remains unknown. (The screenshot comes from our own custom logging in a Log Analytics Workspace.)

To Reproduce Setup a pod that can handle websocket connections and notice that the websocket are closed every X time, regardless of traffic going over the websockets.

Ingress Controller details

Name:         ingress-appgw-deployment-66b997c8cd-c29r7
Namespace:    kube-system
Priority:     0
Node:         aks-agentpool-26037871-vmss000000/10.5.0.4
Start Time:   Fri, 24 Sep 2021 08:42:00 +0200
Labels:       app=ingress-appgw
              kubernetes.azure.com/managedby=aks
              pod-template-hash=66b997c8cd
Annotations:  checksum/config: 810370c82f65bc701ac95e1bb0a9f01ceedced926bbcc0e5e7163fe2a156b005
              cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubernetes.azure.com/metrics-scrape: true
              prometheus.io/path: /metrics
              prometheus.io/port: 8123
              prometheus.io/scrape: true
              resource-id:
                ***
Status:       Running
IP:           ***
IPs:
  IP:           ***
Controlled By:  ReplicaSet/ingress-appgw-deployment-66b997c8cd
Containers:
  ingress-appgw-container:
    Container ID:   containerd://13db8546dc4827f6edd64162c93a47325a39145bcd567a91cc747aece3506fe3
    Image:          mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:1.4.0
    Image ID:       sha256:533f2cbe57fa92d27be5939f8ef8dc50537d6e1240502c8c727ac4020545dd34
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 24 Sep 2021 08:42:08 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     700m
      memory:  100Mi
    Requests:
      cpu:      100m
      memory:   20Mi
    Liveness:   http-get http://:8123/health/alive delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8123/health/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      ingress-appgw-cm  ConfigMap  Optional: false
    Environment:
***
    Mounts:
      /etc/kubernetes/azure.json from cloud-provider-config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tzwhf (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  cloud-provider-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/azure.json
    HostPathType:  File
  kube-api-access-tzwhf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

urucoder commented 2 years ago

We're facing the same issue with a python web application and the same ingress/service configuration, also changing the request timeout has a similar effect for us. I want to add that the channel is being closed with the code:

1006 - Abnormal Closure - Indicates that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected.

baptistepattyn commented 2 years ago

@urucoder Are you using the classic Application Gateway or do you use AGIC?

urucoder commented 2 years ago

@baptistepattyn We use AGIC

Sfonxs commented 2 years ago

I have been in contact with support and together we found out the root cause of the issue: Some unrelated pod in an unrelated namespace was in a crash loop, and every time it tried to start up, the AGIC picked up the new POD IP and pushed the changes to the Application Gateway. But the application gateway will disconnect all open websockets connection for any update to any backend pool.

Fixing the crash loop fixed the disconnect problem.

With this info we were also able to confirm that starting/stopping any pods with a linked ingress controller configuration in the whole cluster, will disconnect all websockets of the application gateway. This behavior is very inconvenient and unexpected, and I still see this as an issue.

mscatyao commented 2 years ago

@Sfonxs - was the AGIC set to watch all namespaces or was the namespace that had the pod in a crash loop outside of the AGIC scope?

Sfonxs commented 2 years ago

Thanks for reaching out @mscatyao AGIC was set up to watch all namespaces, as all namespaces have at least some pods that need to have the AGIC ingress controllers. In this case the crashing pod also had a matching AGICs.

mscatyao commented 2 years ago

When the pod crashes and starts up again, does the Pod IP change every time or remain the same?

Sfonxs commented 2 years ago

The pod IP stays the same, it is the container within the pod that kept crashing

FrancescoRestelli commented 2 years ago

i tried to get this flaw in application gateway fixed via a support ticket but they confirmed it´s designed behaviour and recommende to open a user feedback

https://feedback.azure.com/d365community/idea/a2dcfc40-ba9f-ec11-a81c-000d3adfb8f5

so dont use AGIC if you have many websockets client that should not reconnect due to the slightest change in the ingress rules

marxxxx commented 2 years ago

i tried to get this flaw in application gateway fixed via a support ticket but they confirmed it´s designed behaviour and recommende to open a user feedback

https://feedback.azure.com/d365community/idea/a2dcfc40-ba9f-ec11-a81c-000d3adfb8f5

so dont use AGIC if you have many websockets client that should not reconnect due to the slightest change in the ingress rules

Thanks for finding this out. So this also means, all websocket connections are reset when we scale pods? So this makes this setup unusable for dynamic scaling?

I created a stackoverflow question describing my repro here

ghost commented 2 years ago

As this was still an ongoing issue that gives us big problems for our websocket stability, we had to completely step away from AGIC. So now we have put the whole AppGw in a bicep and route to our LoadBalancer services in the AKS cluster.

So in conclusion, if you use websockets at all, do not use AGIC as it will eventually give you stability problems. Every time HPA triggers or a new container is deployed, all websockets in the whole cluster will be disconnected.

So for me this issue can be closed as we no longer use this service...

shudson302 commented 1 year ago

Any update on this being fixed?

nabeel-jc commented 6 months ago

We are also encountering the same issue. Any update on when this should be fixed?

Azure / application-gateway-kubernetes-ingress

WebSockets disconnecting even while in use #1277