Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

Application Gateway for Containers ALB Controller not working for ARM Architecture #4516

Closed AndrewJR350 closed 3 weeks ago

AndrewJR350 commented 1 month ago

Describe the bug The Application Gateway for Containers' ALB Controller is not functioning on ARM architecture when following the provided documentation.

To Reproduce Steps to reproduce the behavior:

  1. Followed the commands in the documentation to install the Application Gateway using Helm.
  2. The ALB pods failed to start and crashed repeatedly. The error indicates that the format is not supported.

Expected behavior The ALB Controller installation should start and register itself as healthy.

Environment (please complete the following information):

JackStromberg commented 1 month ago

Hello @AndrewJR350,

Do you have a link to the document you are following? The helm installation command should have no dependency on ARM.

What errors are you seeing on pod start?

AndrewJR350 commented 1 month ago

Hello @AndrewJR350,

Do you have a link to the document you are following? The helm installation command should have no dependency on ARM.

What errors are you seeing on pod start?

Hi @JackStromberg

I was following this documentation. This created the pod, and I was able to list the ALB controller using the command kubectl get pods -n azure-alb-system.

However, the pod never started. It pulled the image successfully, but when I checked the logs and state of the pod it was in a crash loop with the following error Error: exec user process caused "exec format error"

I checked the manifest of the image installed via Helm, and it doesn't mention anything platform-specific. I was not able to find any code based on this image as well.

image
JackStromberg commented 1 month ago

What is your AKS cluster version?

What is the output when describing the bootstrap pod? kubectl describe pod alb-controller-bootstrap-<unique-id> -n azure-alb-system

a7ul commented 1 month ago

Hi @JackStromberg This is the error we see on the pod

 init-alb-controller-crds exec /usr/bin/sh: exec format error

This is the result of the command you requested:

kubectl describe pod alb-controller-bootstrap-778f96cfb4-mhdlw -n azure-alb-system
Name:                 alb-controller-bootstrap-778f96cfb4-mhdlw
Namespace:            azure-alb-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      alb-controller-sa
Node:                 aks-userpool-22578432-vmss000000/10.0.0.113
Start Time:           Wed, 18 Sep 2024 23:22:57 +0200
Labels:               app=alb-controller-bootstrap
                      pod-template-hash=778f96cfb4
Annotations:          kubernetes.azure.com/set-kube-service-host-fqdn: true
                      prometheus.io/port: 9002
                      prometheus.io/scrape: true
Status:               Pending
IP:                   10.0.0.132
IPs:
  IP:           10.0.0.132
Controlled By:  ReplicaSet/alb-controller-bootstrap-778f96cfb4
Init Containers:
  init-alb-controller-crds:
    Container ID:  containerd://464c73fa9d9eb27dbb3c4f0ee5dce9682dbe45bda9fc5a75c7378bc1ee6e23d5
    Image:         mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3
    Image ID:      mcr.microsoft.com/application-lb/images/alb-controller-crds@sha256:71dc7b7cc810a8eefb5d2fc12253a2aac42785483277101b93d90e83761aa218
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      kubectl apply -f /alb-controller-crds/agc-crds; kubectl apply -f /alb-controller-crds/gateway-api-crds;

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 18 Sep 2024 23:28:49 +0200
      Finished:     Wed, 18 Sep 2024 23:28:49 +0200
    Ready:          False
    Restart Count:  6
    Environment:
      KUBERNETES_SERVICE_HOST:       appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io
      KUBERNETES_PORT:               tcp://appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io:443
      KUBERNETES_PORT_443_TCP_ADDR:  appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cl84l (ro)
Containers:
  alb-controller-bootstrap:
    Container ID:
    Image:         mcr.microsoft.com/application-lb/images/alb-controller-bootstrap:1.2.3
    Image ID:
    Port:          9005/TCP
    Host Port:     0/TCP
    Command:
      /alb-controller-bootstrap
    Args:
      --log-level
      info
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  128Mi
    Requests:
      cpu:      100m
      memory:   128Mi
    Liveness:   http-get http://:9005/healthz delay=5s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:9005/healthz delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:
      KUBERNETES_SERVICE_HOST:       appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io
      KUBERNETES_PORT:               tcp://appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io:443
      KUBERNETES_PORT_443_TCP_ADDR:  appkube-dns-7bp3b9jc.hcp.eastus.azmk8s.io
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cl84l (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 False
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-cl84l:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  10m                  default-scheduler  Successfully assigned azure-alb-system/alb-controller-bootstrap-778f96cfb4-mhdlw to aks-userpool-22578432-vmss000000
  Normal   Pulling    10m                  kubelet            Pulling image "mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3"
  Normal   Pulled     10m                  kubelet            Successfully pulled image "mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3" in 5.681s (5.681s including waiting). Image size: 22749663 bytes.
  Normal   Created    8m28s (x5 over 10m)  kubelet            Created container init-alb-controller-crds
  Normal   Started    8m28s (x5 over 10m)  kubelet            Started container init-alb-controller-crds
  Normal   Pulled     8m28s (x4 over 10m)  kubelet            Container image "mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3" already present on machine
  Warning  BackOff    1s (x47 over 10m)    kubelet            Back-off restarting failed container init-alb-controller-crds in pod alb-controller-bootstrap-778f96cfb4-mhdlw_azure-alb-system(d704c403-e510-44d8-ac2b-ead7efc43511)

Upon deeper inspection I think the issue is with the image the controller is using. The controller is using mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3 which doesnt have an arm equivalent

The image internally on the step 3 does:

  RUN /bin/sh -c wget https://storage.googleapis.com/kubernetes-release/release/v1.30.1/bin/linux/amd64/kubectl -O /bin/kubectl && chmod +x /bin/kubectl

And here the architecture is hard coded to amd64 of kubectl which when executed on the ARM cluster breaks.

Hope this was helpful

a7ul commented 1 month ago

Docker inspect of the image url for the controller paints a similar picture as well

docker inspect  mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3
[
    {
        "Id": "sha256:71dc7b7cc810a8eefb5d2fc12253a2aac42785483277101b93d90e83761aa218",
        "RepoTags": [
            "mcr.microsoft.com/application-lb/images/alb-controller-crds:1.2.3"
        ],
        "RepoDigests": [
            "mcr.microsoft.com/application-lb/images/alb-controller-crds@sha256:71dc7b7cc810a8eefb5d2fc12253a2aac42785483277101b93d90e83761aa218"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2024-08-30T16:36:00.029033041Z",
        "DockerVersion": "27.2.0",
        "Author": "",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": null,
            "ArgsEscaped": true,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "/",
            "Entrypoint": [
                "/bin/kubectl"
            ],
            "OnBuild": null,
            "Labels": {
                "com.visualstudio.msazure.image.build.buildnumber": "1.2.3",
                "com.visualstudio.msazure.image.build.builduri": "vstfs:///Build/Build/102031627",
                "com.visualstudio.msazure.image.build.definitionname": "Networking-Kubic-Official",
                "com.visualstudio.msazure.image.build.repository.name": "Networking-Kubic",
                "com.visualstudio.msazure.image.build.repository.uri": "https://msazure.visualstudio.com/One/_git/Networking-Kubic",
                "com.visualstudio.msazure.image.build.sourcebranchname": "1.2",
                "com.visualstudio.msazure.image.build.sourceversion": "3c7c80cee2980b1099dd12b97d7bc24acf9490bb",
                "com.visualstudio.msazure.image.system.teamfoundationcollectionuri": "https://msazure.visualstudio.com/",
                "com.visualstudio.msazure.image.system.teamproject": "One",
                "image.base.digest": "sha256:e01de8a38d8a6ea1bc7212b4875b084bbb12dc3b7d93c570231a61887e04e5c8",
                "image.base.ref.name": "mcr.microsoft.com/cbl-mariner/busybox:1.35"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 22749663,
        "GraphDriver": {
            "Data": null,
            "Name": "overlayfs"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:fa7546a223a4f6cf426563f87ff81033a891a2a4048a99e987026ff7753440fe",
                "sha256:9f2106c783cbea22d069ff58967fa5891579bfe08f15af51f1d9d0e283c23715",
                "sha256:095655321eed6ff423681ae396f527e1c7af63296a56a73e7444002919bacad7",
                "sha256:d8a13698cfbcda3399b049841cd19985e40bf32e1dba1833aa221a11f63abd3b"
            ]
        },
        "Metadata": {
            "LastTagTime": "2024-09-18T21:24:56.720106714Z"
        }
    }
]
JackStromberg commented 1 month ago

Sorry, I misunderstood as ARM (Azure Resource Manager), not ARM in the context of compute.

Unfortunately, AGC does not support ARM based compute today. Will update our docs and have added this as a future feature item to our backlog.

Thank you for bubbling up!

a7ul commented 1 month ago

Thanks @JackStromberg Is the source code for the controller publicly available? If possible we could lend a hand and contribute to it

JackStromberg commented 3 weeks ago

@a7ul, apologies for the slow response. Appreciate the offer, however ALB Controller is closed source. We don't have plans on making it open source at this time. The ask for supporting ARM architecture is certainly valid and I've labeled as an item needed for AGIC to AGC parity.

Thank you again for bubbling up!

please-close

pdeva commented 1 day ago

+1 for arm support. this is the only workload on our cluster requiring us to spin up x86 node