Azure / aksArc

# Welcome to the Azure Kubernetes Service on Azure Stack HCI repo This is where the AKS-HCI team will track features and issues with AKS-HCI. We will monitor this repo in order to engage with our community and discuss questions, customer scenarios, or feature requests. Checkout our projects tab to see the roadmap for AKS-HCI!
MIT License
109 stars 45 forks source link

[BUG] Creating a new AKS Hybrid workload cluster with -EnableAzureRBAC fails during azure-arc-onboarding due to k8s version #348

Closed eponerine closed 9 months ago

eponerine commented 1 year ago

Describe the bug While trying to deploy AKS Hybrid with the -enableAzureRBAC flag and prereqs, the last step of New-AksHciCluster hangs for 1800 seconds:

image

If you get the logs of that pod (obfuscating some GUIDs), you'll see its hung up on Error from server (NotFound): namespaces "azure-arc" not found:

C:\Windows\System32> kubectl logs azure-arc-onboarding-p6jpw -n azure-arc-onboarding

Cluster "azure-arc-kubernetes-bootstrap" set.
User "azure-arc-kubernetes-bootstrap" set.
Context "azure-arc-kubernetes-bootstrap" created.
Switched to context "azure-arc-kubernetes-bootstrap".
Set kubecontext successfully
az cloud set -n AzureCloud
az login --service-principal --username 4dcd36f3-20ee-xxxx-xxxx-xxxxxxxxxxxxxxxx --password xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx --tenant 40ed1e38-a16e-4622-xxxx-xxxxxxxxxxxxxxxxx
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "40ed1e38-a16e-4622-xxxx-xxxxxxxxxxxxxxxxx",
    "id": "58debcdd-1886-404c-b68f-xxxxxxxxxxxxxxxxx",
    "isDefault": true,
    "managedByTenants": [],
    "name": "Dev - AKS Hybrid",
    "state": "Enabled",
    "tenantId": "40ed1e38-a16e-4622-xxxx-xxxxxxxxxxxxxxxxx",
    "user": {
      "name": "4dcd36f3-20ee-xxxx-xxxx-xxxxxxxxxxxxxxxx",
      "type": "servicePrincipal"
    }
  }
]
az account set -s 58debcdd-1886-404c-b68f-xxxxxxxxxxxxxxxxx
Error: release: not found
Helm release absent; Need to connect
WARNING: The properties DISTRIBUTION, INFRASTRUCTURE, LOCATION will be ignored for update commands
Set env variables for COMMAND and ARGS
Error from server (NotFound): namespaces "azure-arc" not found
az connectedk8s connect -g AKS-HyBridEngg -n erniecluster02  --distribution aks_workload --infrastructure azure_stack_hci --location eastus --onboarding-timeout 1800

That namespace exists, so I'm unsure what the error is going on about:

C:\Windows\System32> kubectl get namespace
NAME                   STATUS   AGE
azure-arc              Active   19m
azure-arc-onboarding   Active   20m
default                Active   25m
kube-node-lease        Active   25m
kube-public            Active   25m
kube-system            Active   25m

C:\Windows\System32> kubectl get pods -n azure-arc
NAMESPACE     NAME                                             READY   STATUS    RESTARTS   AGE
azure-arc     cluster-metadata-operator-5d76c86878-d655l       2/2     Running   0          27m
azure-arc     clusterconnect-agent-b8bdcf5b-xbz99              3/3     Running   0          27m
azure-arc     clusteridentityoperator-574fd956f8-pfxtf         2/2     Running   0          27m
azure-arc     config-agent-b5447c5d8-r5d7j                     2/2     Running   0          27m
azure-arc     controller-manager-6766955447-h9rbw              2/2     Running   0          27m
azure-arc     extension-events-collector-5b8499ff74-5xtsp      2/2     Running   0          27m
azure-arc     extension-manager-56957c9b5-5sfdd                3/3     Running   0          27m
azure-arc     flux-logs-agent-55d79f94cd-wbrvg                 1/1     Running   0          27m
azure-arc     guard-755cc8dd58-vbpjd                           0/2     Pending   0          27m
azure-arc     kube-aad-proxy-589c8dc56b-q97dp                  2/2     Running   0          27m
azure-arc     metrics-agent-d87c65f5f-qkw9s                    2/2     Running   0          27m
azure-arc     resource-sync-agent-64fcff7969-vqjzv             2/2     Running   0          27m

But the pod guard-755cc8dd58-vbpjd sticks out as it is 0/2 status, so let's describe it:

Node-Selectors:              kubernetes.io/os=linux
                             node-role.kubernetes.io/control-plane=
                             node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m44s  default-scheduler  0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

Well that is a problem! Kubernetes changed from master to control-plane awhile ago. Node-Selector labels are "inclusive", meaning all labels must match (a logical AND, if you will): https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

The way this deployment is configured will never work, as you wont have control-plane and master both at the same time (unless jumping between major version differences of k8s???).

I cannot personally confirm if this will progress now, because I think the job times out before I can manually edit things to "work", but this should be easy to reproduce.

To Reproduce

  1. Follow the steps and pre-reqs here to get your environment configured for Azure RBAC: https://learn.microsoft.com/en-us/azure/aks/hybrid/azure-rbac-aks-hybrid#create-the-server-app-and-client-app
  2. Deploy your cluster following the steps in the link above (Step 3, option A)
  3. Watch it "hang" on the onboarding step

Expected behavior Azure Arc registration completes with the -enableAzureRBAC flag and config.

Environment (please complete the following information):

Collect log files

eponerine commented 1 year ago

This DEF is related to the version of Kubernetes it is running on.

Tested the deployment on 1.23 and no scheduling issues. Looking at the nodes, we can see that they're both control-plane and master!

Someone has to update the azure-arc-onboarding stuff (namely, the guard deployment).

PS C:\Windows\system32> kubectl get nodes -o wide
NAME              STATUS     ROLES                  AGE    VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
moc-l595w5d9s4u   NotReady   <none>                 1s     v1.23.12   172.24.144.68   <none>        CBL-Mariner/Linux   5.15.102.1-3.cm2   containerd://1.6.18
moc-ljwkuwsej4z   Ready      control-plane,master   5m8s   v1.23.12   172.24.144.67   <none>        CBL-Mariner/Linux   5.15.102.1-3.cm2   containerd://1.6.18
moc-lqdv2ou1uqw   Ready      control-plane,master   10m    v1.23.12   172.24.144.66   <none>        CBL-Mariner/Linux   5.15.102.1-3.cm2   containerd://1.6.18
Elektronenvolt commented 1 year ago

Hi @eponerine

I'm using Azure AD RBAC as well, same deployment option. The clusters are at Kubernetes 1.24.11 and control planes are labeled as 'control-plane.master' and everything works fine. It starts to fail with Kubernetes 1.25, because arc onboarding will not finish - old / missing control-plane label - correct?

eponerine commented 1 year ago

Yup, bingo.

PG is aware of this. At least I spammed them via email a few weeks ago.

My suggestion is to open a ticket thru Azure Portal about this issue so it gets engineering eyes on it.

Elektronenvolt commented 1 year ago

Ok, thanks. I'll stay at Kubernetes 1.24.* until this is fixed.

Elektronenvolt commented 11 months ago

Tested it again with the latest AKS-hybrid-2307 release. Not solved yet. image

Using -kubernetesVersion "v1.24.11" at all clusters with AzureRBAC enabled.

eponerine commented 11 months ago

@Elektronenvolt - can you open a ticket thru Azure Portal describing this issue? And once done, pass the ticket number along to me?

I never got around to opening one and PG said they need one to move forward. My environment is currently at capacity to test this.

Elektronenvolt commented 11 months ago

@eponerine - created a support case and added you to a conversation with the PG.

eponerine commented 11 months ago

@Elektronenvolt - thanks. I pinged the same folks (plus more) on my other email thread referencing this ticket. Hopefully there is movement.

Elektronenvolt commented 11 months ago

I got a workaround from support and tested it now - following works with Kubernetes 1.25.*

  1. Create a new cluster with -EnableAzureRBAC, wait for the error message and connect by certificate

  2. Get the guard deployment: kubectl get deployment guard -n azure-arc -o yaml > guard.yaml

  3. Edit the guard deployment - remove nodeSelector node-role.kubernetes.io/master: "" and add a toleration for key: node-role.kubernetes.io/control-plane image

  4. Delete the guard deployment: kubectl delete deployment guard -n azure-arc

  5. Deploy it from the modified yaml: kubectl create -f .\guard.yaml

  6. Generate an RBAC config and connect: Get-AksHciCredential -name <cluster> -aadauth

Elektronenvolt commented 9 months ago

@eponerine seems to be fixed in Azure Arc. I've created a cluster with Kubernetes 1.26.3 an hour ago and the Azure AD RBAC onboarding worked fine. I've exported the Guard deployment and compared it to a modified deployment:

Left side: from the new cluster Right side: the applied workaround from an older cluster image

eponerine commented 9 months ago

Interesting. I'll probably be deploying a new cluster soon. I'll test it with latest k8s version

Elektronenvolt commented 9 months ago

It looks like it's fixed. The guard component has a higher version and includes the changes I received from support as workaround.

eponerine commented 9 months ago

Marking as closed.

@Elektronenvolt - what program do you use for diff'ing text files? I love the comparison "flow" UI there. I've been using BeyondCompare for years, but willing to try something new :D

Elektronenvolt commented 9 months ago

Meld: https://meldmerge.org/