Azure / application-gateway-kubernetes-ingress

This is an ingress controller that can be run on Azure Kubernetes Service (AKS) to allow an Azure Application Gateway to act as the ingress for an AKS cluster.
https://azure.github.io/application-gateway-kubernetes-ingress
MIT License
677 stars 420 forks source link

AGIC 1.7.0 running into segmentation fault when using workload identity #1532

Closed HelenaSeidel closed 1 year ago

HelenaSeidel commented 1 year ago

Describe the bug k8s version: 1.25.6 AGIC version: 1.7.0 I have to mention that we had a k8s upgrade from 1.25.4 although I dont really believe that this is related.

We had the AGIC running with workload identity once, however now it is running into a segmentation fault shortly after startup UPDATE: we identified why it was working before, see below in repro steps

I0414 07:44:24.141494       1 utils.go:114] Using verbosity level 3 from environment variable APPGW_VERBOSITY_LEVEL
I0414 07:44:24.176221       1 supported_apiversion.go:70] server version is: 1.25.6
I0414 07:44:24.187027       1 environment.go:294] KUBERNETES_WATCHNAMESPACE is not set. Watching all available namespaces.
I0414 07:44:24.187049       1 main.go:118] Using User Agent Suffix='***' when communicating with ARM
I0414 07:44:24.187126       1 main.go:137] Application Gateway Details: Subscription="***" Resource Group="***" Name="****"
I0414 07:44:24.187137       1 auth.go:58] Creating authorizer using Default Azure Credentials
I0414 07:44:24.187145       1 httpserver.go:57] Starting API Server on :8123
I0414 07:44:24.423737       1 main.go:184] Ingress Controller will observe all namespaces.
I0414 07:44:24.486990       1 context.go:171] k8s context run started
I0414 07:44:24.487037       1 context.go:238] Waiting for initial cache sync
I0414 07:44:24.487117       1 reflector.go:219] Starting reflector *v1.Pod (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487413       1 reflector.go:219] Starting reflector *v1.Endpoints (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487427       1 reflector.go:255] Listing and watching *v1.Endpoints from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487431       1 reflector.go:255] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487119       1 reflector.go:219] Starting reflector *v1.Secret (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487728       1 reflector.go:255] Listing and watching *v1.Secret from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488102       1 reflector.go:219] Starting reflector *v1.Ingress (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488112       1 reflector.go:255] Listing and watching *v1.Ingress from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488247       1 reflector.go:219] Starting reflector *v1beta1.AzureApplicationGatewayRewrite (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488264       1 reflector.go:255] Listing and watching *v1beta1.AzureApplicationGatewayRewrite from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488461       1 reflector.go:219] Starting reflector *v1.IngressClass (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488474       1 reflector.go:255] Listing and watching *v1.IngressClass from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.487136       1 reflector.go:219] Starting reflector *v1.Service (30s) from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.488982       1 reflector.go:255] Listing and watching *v1.Service from pkg/mod/k8s.io/client-go@v0.20.0-beta.1/tools/cache/reflector.go:167
I0414 07:44:24.587595       1 context.go:251] Initial cache sync done
I0414 07:44:24.587633       1 context.go:252] k8s context run finished
I0414 07:44:24.587758       1 worker.go:39] Worker started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1375dcf]

goroutine 166 [running]:
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).newListener(0xc000842ea0, 0x0?, {0x50, {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...)
        /azure/pkg/appgw/frontend_listeners.go:155 +0x6f
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).getListeners(0xc000842ea0, 0xc0006de200)
        /azure/pkg/appgw/frontend_listeners.go:39 +0x2f3
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).Listeners(0xc000842ea0, 0xc0006de200?)
        /azure/pkg/appgw/http_listeners.go:11 +0x58
github.com/Azure/application-gateway-kubernetes-ingress/pkg/appgw.(*appGwConfigBuilder).Build(0xc000842ea0, 0x32b2?)
        /azure/pkg/appgw/configbuilder.go:119 +0x338
github.com/Azure/application-gateway-kubernetes-ingress/pkg/controller.AppGwIngressController.MutateAppGateway({{0x194b4e0, 0xc00014a000}, {{0xc00004a156, 0x24}, {0xc0000460d5, 0x10}, {0xc00004c00b, 0x14}}, 0xc0002db9b0, 0xc000528180, ...}, ...)
        /azure/pkg/controller/mutate_app_gateway.go:128 +0x7b3
github.com/Azure/application-gateway-kubernetes-ingress/pkg/controller.(*AppGwIngressController).ProcessEvent(0xc0001a35e0, {0xc00065af20?, {0x16d5d40?, 0xc0007a4140?}})
        /azure/pkg/controller/controller.go:134 +0x32c
github.com/Azure/application-gateway-kubernetes-ingress/pkg/worker.(*Worker).Run(0xc0001b03a0, 0xc0002ed380, 0xc000306de0)
        /azure/pkg/worker/worker.go:61 +0x405
created by github.com/Azure/application-gateway-kubernetes-ingress/pkg/controller.(*AppGwIngressController).Start
        /azure/pkg/controller/controller.go:83 +0x205

To Reproduce Steps to reproduce the behavior: start AGIC 1.7.0 with workload identity

UPDATE: we reproduced an old scenario where we had the AGW configured by the AGIC 1.6.0 and then rolled the upgrade on the AGIC to 1.7.0, it is working now (not sure for how long tho) - This means it has some issues with running on an empty AGW

Ingress Controller details

HelenaSeidel commented 1 year ago

this has been mentioned in #1364 by @giuliocalzolari and we can confirm this behavior

HelenaSeidel commented 1 year ago

to reproduce this properly, AGIC 1.7.0 has to run on an empty AGW, if it is pre configured, i.e. an older AGIC version ran before, AGIC 1.7.0 seems to be working properly also creating new configuration and stuff ¯\_(ツ)_/¯

DarChaos21 commented 1 year ago

We seem to have the same issue. When we use version 1.6 with service principal, the deployment works. Then when we upgrade to version 1.7 with managed identity, it also works. But when we use version 1.7 from scratch, we have the same error.

seizste commented 1 year ago

Can confirm this problem on empty Application Gateways as well with v1.7.0.

Our first try was to use Workload Identity with a completely fresh installation but could not get it working. After that, we tried to configure a fresh Installation of v1.7.0 on an Empty Application Gateway with Managed Identity and got the same error as @HelenaSeidel described above.

Same configuration with Managed Identity and Version 1.6.0 worked on an empty Application Gateway.

cloebig commented 1 year ago

We can confirm also with empty new AGW we run in segmentation violation. Is there a complete Example to use Azure agic with Workload Identity somewhere?

cloebig commented 1 year ago

Hello,

is the segmentation fault now fixed? And is there new Version for that ? See only the 1.7.0 from 27 March 2023.

I asked already here https://github.com/Azure/application-gateway-kubernetes-ingress/pull/1538

HelenaSeidel commented 1 year ago

we can confirm, it is fixed, thank you @akshaysngupta

karlschriek commented 1 year ago

Fix isn't present in any release yet though. Is there and e.t.a on a 1.7.1 or 1.8.0 release that will include this?

johnnyaug commented 1 year ago

Using the nightly build has fixed it for me. Will appreciate an ETA on a release.

HelenaSeidel commented 1 year ago

they have overridden the old 1.7.0 tag... digests changed... re-pulling the image will do it... its not best practice but that is what happened ¯\_(ツ)_/¯

cloebig commented 1 year ago

is the Helm Chart (ingress-azure) also Updated ?

karlschriek commented 1 year ago

they have overridden the old 1.7.0 tag... digests changed... re-pulling the image will do it... its not best practice but that is what happened ¯_(ツ)_/¯

When did they do that? I deployed 1.7.0 earlier today and the error was still present. I then created my own image from the commit in #1538 and it worked...

@cloebig the helm chart is still on 1.6.0, but you could actually bump the image to for example 1.7.0 by setting the image.tag value (you can see here that it is configurable in the values file)

AndreiBarbu95 commented 1 year ago

Hello @HelenaSeidel @DarChaos21 @seizste ! I've been trying to update from 1.6.0 to 1.7.0 a few minutes ago via:

helm upgrade \ ingress-azure \ application-gateway-kubernetes-ingress/ingress-azure \ --version 1.7.0

But I still get "panic: runtime error: invalid memory address or nil pointer dereference". This also happens on a fresh installation. Not sure if it is related, but I use AGIC with Helm and service principal. I'll ask a few question if you can help, please:

1 - How were you able to make 1.7.0 working and how did you transition to workload identity?

2 - I can see you mentioned you configured AGIC using Managed Identity. Was that via Helm or add-on. If via Helm, could you please let me know how? I don't see that managed identity option here https://github.com/Azure/application-gateway-kubernetes-ingress/blob/master/docs/helm-values-documenation.md

Thank you!

TimDurward commented 1 year ago

I'm getting a panic too on 1.7.0, seems to work if I run 1.6.0 though.

HelenaSeidel commented 1 year ago

sry haven't had this on my radar anymore.. there was a new release 3 weeks ago which should fix all issues that are related to the fix being implemented in a new image with the same old tag... If there are any further issues, I suggest opening a new dedicated issue b/c this here has been fixed