Application Gateway connected to AGIC is so slow it can take 40 minutes to update pod ip adresses

NiklasArbin commented 3 years ago

We run a AKS with AGIC and Application Gateway.

During deploy the Application Gateway drops all old pods from the backend pools but does not send traffic the the new pods directly. It takes 10min to 45min for the application gateway to update to the new state. As a result we loose 100% of traffic and failover to our secondary AKS is initiaded by traffic manager.

After ~45 minutes the Application Gatway found the new pods this time.

During this downtime I saw the correct IPs (in the backend pool) for the pods in our AKS. But they did not show up at all in 'Backend Health' view in application gateway.

The problem is that the application gateway gets stuck in updating state or is atleast very very slow at updating.

To Reproduce Run agic with production load and deploy new pods

Any Azure support tickets associated with this issue. 121020225002461 and 121011125001033

akshaysngupta commented 3 years ago

@NiklasArbin I took a quick look and found that update was delayed as there was a platform upgrade in progress.

NiklasArbin commented 3 years ago

@NiklasArbin I took a quick look and found that update was delayed as there was a platform upgrade in progress.

Ok, great to have some deeper understanding of what's happening. How are we as Customers supposed to deal with that the platform might be unavailable for updates at (for us) random times, leaving our api endpoints in a 502 state? Bonus is that'll have to pay for both the unreachable cluster, and also for the backup cluster now scaling up to handle 100% traffic.

Some questions that arise:

If the platform is upgraded should not the instances be switched and removed from the customer while upgrading?
Should you introduce service binding in agic as an option instead of pointing directly to the pods?
Should we abandon agic and try something else? What options are available instead? Nginx? Nginx+agic.

There seems to be something fundamentally broken in the system design here. Can I or my team be of assistance here?

NiklasArbin commented 3 years ago

@akshaysngupta So I've done some more research and received answers about the Application Gateway update performance from Azure Support. My conclusion is that the foundation of the AGIC design is broken, because the Application Gateway does not work the way AGIC needs it to work.

Application Gateway will usually update ip addresses within 60 seconds, but there is no such guarantee. From Azure support, a 40 minute update is working as intended.

The AGIC is built upon the assumption that the Application gateway guarantees fast updates. This assumption is false. Obviously the AppGw should have better performance and guarantees, but that is simply not the case.

Possible solutions that are within the AGIC teams reach:

Allow service routing, do not bind directly to pod ip.
Allow AGIC to configure AppGw to point to an Internal Load Balancer
Allow AGIC to configure AppGW with custom ip adresses

But you need to change the documentation until this is resolved. In current state with pod ip binding AGIC is not Production Ready.

Best Regards Niklas

thatguycalledrob commented 3 years ago

It is the same for us, restarting pods (AKA totally normal & standard k8s deployment update behaviour) causes the backend pools to become unhealthy for 20-40 mins, even though the pods are back up and responding to the health checks in <5s.

Its pretty unbelievable that its routing to the PodIP and NOT the virtual service IP configured by k8s. It looks like AGIC is treating the pods as ephemeral VM's which inconveniently exist within the k8s network, rather than a black box which sit behind a service which exists in a k8s cluster. There is a reason that we specify a serviceName and servicePort in the ingress, and that's not as a handy-dandy "look for pods under this name" flag.

@akshaysngupta - I have sent @awkwardindustries the details around our issue and deployment. I can also give you a demo of the issue over teams if this would help.

They is a pretty large showstopper for us. We are deploying onto an infrastructure which is out of the software development teams control, and we have been told that we must use Appgw. Yet it seems totally unfit for purpose (within the context of k8s).

NiklasArbin commented 3 years ago

@thatguycalledrob I wanted to let the AGIC team respond to this, but since it's a bit quiet here i'll fill you in on what we've done.

After consulting with our Microsoft account manager and his technical team, their response was that we should not use AGIC in production also based from other customers experience. So AGIC is a not go for us.

However the workaround is pretty simple, just expose a service of type LoadBalancer with the internal annotation, and then connect another AppGw to that internal load balancer, and deploys are now smooth as silk.

jibinbabu commented 3 years ago

Is application gateway + nginx ingress a feasible way for AKS if AGIC is broken? or simply nginx-ingress controller with default azure lb should do? Appreciate any comments

thatguycalledrob commented 3 years ago

@NiklasArbin - The really frustrating thing for me, is that the decision to use AppGW/AGIC is driven by MSFT engineers and architects who we are collaborating with. Therefore I am unable to NOT use AGIC, even though I have been advised that there are "quite a few known issues, which multiple clients have flagged up". I wish I could just use the standard external Azure LB & completely drop AGIC, but this is proving difficult politically.

Here is a fun fact - the shared Kubernetes cluster (designed + built by MSFT) which we are supposed to be deploying our application into currently has autoscaling disabled because AGIC can't ensure HA when autoscaling! Its almost as if the solution has been designed to work on VM's, and been inconveniently deployed into a kubernetes cluster and VPC...

Anyway, thanks for the tip - I will take a look at placing an internal LB in front of my services. It feels like I am emulating a Service IP using a load balancer at that point, but I suppose that this is more akin to how k8s is supposed to work.

Link to the relevant docs for those who reach this issue in the future: https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/aks/load-balancer-standard.md

@jibinbabu - unless you're in need of a WAF (or one of the other AppGW features), it sounds like you should go for the standard Azure LB. I suppose that you could run nginx and point AppGW at it, but you will still run into the issue of 502's when autoscaling.

chunliu commented 3 years ago

@NiklasArbin and @thatguycalledrob I am wondering if the tips and tricks mentioned in this document could help addressing the issues you are facing? https://azure.github.io/application-gateway-kubernetes-ingress/how-tos/minimize-downtime-during-deployments/

NiklasArbin commented 3 years ago

@NiklasArbin and @thatguycalledrob I am wondering if the tips and tricks mentioned in this document could help addressing the issues you are facing? https://azure.github.io/application-gateway-kubernetes-ingress/how-tos/minimize-downtime-during-deployments/

Sorry no, we already implemented all the suggestions from that document. Problem is that the problem is very fundamental in the way the AGIC is built and how the AppGW works, are not compatible. The AGIC design requires AppGW to make guarantees that it simply doesn't. The design is in itself broken.

I have made suggestions for a design that would be in line with what the AppGW can guarantee, but the team have not responded. There seems to be more than technical issues that this team struggles with.

ghost commented 3 years ago

This is pretty severe, since simple IP table changes result in a couple of minutes of down time - in production. The Kubernetes Service responsible for distribution of traffic to the pods in a Deployment is taken out of the equation, instead AppGw communicate directly with the pods and it's considered a feature with "one less hop", but it's broken. It seems like a fundamental design flaw and the above tips & tricks is not a work-around sufficient to eliminate the problems.

SteveCurran commented 3 years ago

@NiklasArbin I came across this issue today because we are being told by Microsoft that AKS 1.20+ will have DSR enabled by default. Unfortunately, they are stating that DSR will not work with any ingress controller except AGIC. We have multi-tenancy in our clusters and having to use an internal load balancer on each customer's service proved costly, that is why we used nginx ingress to control these costs. #2236

SteveCurran commented 3 years ago

We also tested with service type of internal load balancer, and with AKS 1.20+ you will receive periodic 502 errors. So AGIC is Microsoft's only option going forward in 1.20+

NiklasArbin commented 3 years ago

@NiklasArbin I came across this issue today because we are being told by Microsoft that AKS 1.20+ will have DSR enabled by default. Unfortunately, they are stating that DSR will not work with any ingress controller except AGIC. We have multi-tenancy in our clusters and having to use an internal load balancer on each customer's service proved costly, that is why we used nginx ingress to control these costs. #2236

@SteveCurran , but this feature is for Windows containers only right? From what I understand the Azure LB already by default operates in a "floating ip" mode under the hood.

NiklasArbin commented 3 years ago

Also @SteveCurran the AGIC implementation is still broken and should not be used in production. The team has yet to respond and keeps ignoring issues regarding this problem in the repo.

SteveCurran commented 3 years ago

@NiklasArbin DSR is enable for both linux and windows in 1.20+ when starting the kube-proxy.

mscatyao commented 3 years ago

We're actively working on reducing update times right now and should have and update by end of June. Do you all have support tickets open on these issues? We can take a look at your gateways through the tickets and validate what the problem is behind your gateway(s). Feel free to tag me in the future for any updates as well.

danielnagyyy commented 3 years ago

@mscatyao Do you have an update on this? Is the update out now? We've experienced a ~10min outage about 2 weeks ago, not sure if it was fixed in the meantime as you've mentioned end of June.

akshaysngupta commented 3 years ago

@danielnagyyy please feel free to create a support ticket. We can take a look at the gateway.

aaronspruit commented 2 years ago

We're actively working on reducing update times right now and should have and update by end of June. Do you all have support tickets open on these issues? We can take a look at your gateways through the tickets and validate what the problem is behind your gateway(s). Feel free to tag me in the future for any updates as well.

Even with an SLA around AppGW updates, this is still not addressing the root cause. An update to AppGW is still a RESTful request that needs to happen off-box, which inherently could fail or take extended amounts of time...causing health probe failures and outages to occur. To prevent this potential outage with each deployment (and as mentioned in the 3rd post) the AppGW should use the service as a backend, as that is not changing during every deployment. This is fundamentally why you specify a service in your ingress definition, and not the individual pod/deployment information - and why other ingress controllers such as NGINX user the service definition also. The service abstraction defines this in a long-lived object / IP.

spr0ut commented 2 years ago

Just to add to this discussion - I think it's pretty telling that ~~none~~ only one of Microsoft's AKS baseline architectures use AGIC:

Basic - Traefik
Multi region - Traefik
PCI DSS - Nginx
Multi tenant - AGIC

After much failed testing with AGIC we are now using Nginx ingress with AppGw as the WAF component only.

kamilzzz commented 2 years ago

Even with an SLA around AppGW updates, this is still not addressing the root cause. An update to AppGW is still a RESTful request that needs to happen off-box, which inherently could fail or take extended amounts of time...causing health probe failures and outages to occur. To prevent this potential outage with each deployment (and as mentioned in the 3rd post) the AppGW should use the service as a backend, as that is not changing during every deployment. This is fundamentally why you specify a service in your ingress definition, and not the individual pod/deployment information - and why other ingress controllers such as NGINX user the service definition also. The service abstraction defines this in a long-lived object / IP.

Agree with everything but the last part regarding other ingress controllers using service definitions. For example, community driven ingress-nginx does not use service by default. It routes traffic directly to the pods. From my understanding, one reason is that with Service as a target, ingress controllers won't be able to offer some of their basic features - like different load balancing methods or session affinity.

ohadschn commented 2 years ago

From my understanding, one reason is that with Service as a target, ingress controllers won't be able to offer some of their basic features - like different load balancing methods or session affinity.

These features would be limited to the first service in the chain anyway. Any further service to service communication inside the cluster would presumably work via k8s service load balancing. At any rate, I think it would be beneficial to at least allow AGIC consumers to choose the simpler k8s load balancing approach if they don't need said features (or at least don't find them important enough to outweigh the cons mentioned above): https://github.com/Azure/application-gateway-kubernetes-ingress/issues/1427

kamilzzz commented 2 years ago

These features would be limited to the first service in the chain anyway. Any further service to service communication inside the cluster would presumably work via k8s service load balancing.

Sure. Ingress controller would ingress to the first service via endpoints and what happens next is completely different story and there are other options to handle next services in the chain (service meshes or maybe kube-proxy with IPVS). I just wanted to point out that it's not extraordinary that Application Gateway Ingress Controller has chosen to use endpoints directly instead of services. However, it obviously has its flaws with the current design.

NiklasArbin commented 2 years ago

Another option if latency of the internal load balancer is an issue is to use headless kubernetes services. and directly access the pods.

ohadschn commented 2 years ago

Another option if latency of the internal load balancer is an issue is to use headless kubernetes services. and directly access the pods.

I don't follow, the issue is the latency of AGIC recognizing changes in the k8s Endpoint API. We actually do want the k8s load balancer to be used (instead of the AGIC one) so that it's less sensitive to these changes (as service cluster IPs are stable).

NiklasArbin commented 2 years ago

Another option if latency of the internal load balancer is an issue is to use headless kubernetes services. and directly access the pods.

I don't follow, the issue is the latency of AGIC recognizing changes in the k8s Endpoint API. We actually do want the k8s load balancer to be used (instead of the AGIC one) so that it's less sensitive to these changes (as service cluster IPs are stable).

You're right, headless would require AGIC to be able to do DNS discovery load balancing.

jblsk commented 1 year ago

Any updates on this issue? We bumped into the topic in 2023.

SilverioMiranda commented 1 year ago

Any updates on this issue? We bumped into the topic in 2023.

Nops, probably in the next 5 years they will fix it

NiklasArbin commented 10 months ago

How is this completed?

SilverioMiranda commented 9 months ago

How is this completed?

icegif-1185

Azure / application-gateway-kubernetes-ingress

Application Gateway connected to AGIC is so slow it can take 40 minutes to update pod ip adresses #1124