cert bot installation failing

osamabinsaleem commented 4 months ago

When I run the install.bat script provided here: https://github.com/LM-Development/aks-sample/tree/main/Samples/PublicSamples/RecordingBot/deploy/cert-manager I get the following error: Capture

I've also tried increasing the timeout period to wait more for the cluster to start the pods like this: kubectl wait pod -n cert-manager --for condition=ready --timeout=300s --all

But I still get the same error.

1fabi0 commented 4 months ago

That's interesting, can you check the health of the cert manager on the aks? It seems like it fails to launch cert manager after installation.

osamabinsaleem commented 4 months ago

@1fabi0 I've executed a couple commands for this and looks like the cert-manager's status is still pending

kubectl get pods --namespace cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-8db45d64b-k2cfr               0/1     Pending   0          164m
cert-manager-cainjector-5c8d6f6646-pkmgq   0/1     Pending   0          164m
cert-manager-startupapicheck-rmdbm         0/1     Pending   0          164m
cert-manager-webhook-7c7d969c76-8n9wq      0/1     Pending   0          164m

and also

>kubectl describe pods --namespace cert-manager
Name:             cert-manager-8db45d64b-k2cfr
Namespace:        cert-manager
Priority:         0
Service Account:  cert-manager
Node:             <none>
Labels:           app=cert-manager
                  app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=cert-manager
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=cert-manager
                  app.kubernetes.io/version=v1.13.3
                  helm.sh/chart=cert-manager-v1.13.3
                  pod-template-hash=8db45d64b
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9402
                  prometheus.io/scrape: true
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cert-manager-8db45d64b
Containers:
  cert-manager-controller:
    Image:       quay.io/jetstack/cert-manager-controller:v1.13.3
    Ports:       9402/TCP, 9403/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --v=2
      --cluster-resource-namespace=$(POD_NAMESPACE)
      --leader-election-namespace=kube-system
      --acme-http01-solver-image=quay.io/jetstack/cert-manager-acmesolver:v1.13.3
      --max-concurrent-challenges=60
    Environment:
      POD_NAMESPACE:  cert-manager (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-smzkj (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-smzkj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Warning  FailedScheduling   14m (x30 over 159m)    default-scheduler   0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  9m8s (x745 over 165m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector
  Normal   NotTriggerScaleUp  4m5s (x192 over 165m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}

Name:             cert-manager-cainjector-5c8d6f6646-pkmgq
Namespace:        cert-manager
Priority:         0
Service Account:  cert-manager-cainjector
Node:             <none>
Labels:           app=cainjector
                  app.kubernetes.io/component=cainjector
                  app.kubernetes.io/instance=cert-manager
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=cainjector
                  app.kubernetes.io/version=v1.13.3
                  helm.sh/chart=cert-manager-v1.13.3
                  pod-template-hash=5c8d6f6646
Annotations:      <none>
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cert-manager-cainjector-5c8d6f6646
Containers:
  cert-manager-cainjector:
    Image:      quay.io/jetstack/cert-manager-cainjector:v1.13.3
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=2
      --leader-election-namespace=kube-system
    Environment:
      POD_NAMESPACE:  cert-manager (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5lkw8 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-5lkw8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Warning  FailedScheduling   14m (x30 over 159m)    default-scheduler   0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  9m9s (x737 over 165m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector
  Normal   NotTriggerScaleUp  4m6s (x201 over 165m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}

Name:             cert-manager-startupapicheck-rmdbm
Namespace:        cert-manager
Priority:         0
Service Account:  cert-manager-startupapicheck
Node:             <none>
Labels:           app=startupapicheck
                  app.kubernetes.io/component=startupapicheck
                  app.kubernetes.io/instance=cert-manager
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=startupapicheck
                  app.kubernetes.io/version=v1.13.3
                  batch.kubernetes.io/controller-uid=ac0949e4-7d18-4b9c-9b02-2f865a211c13
                  batch.kubernetes.io/job-name=cert-manager-startupapicheck
                  controller-uid=ac0949e4-7d18-4b9c-9b02-2f865a211c13
                  helm.sh/chart=cert-manager-v1.13.3
                  job-name=cert-manager-startupapicheck
Annotations:      <none>
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Controlled By:    Job/cert-manager-startupapicheck
Containers:
  cert-manager-startupapicheck:
    Image:      quay.io/jetstack/cert-manager-ctl:v1.13.3
    Port:       <none>
    Host Port:  <none>
    Args:
      check
      api
      --wait=1m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-82fc6 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-82fc6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   14m (x30 over 159m)     default-scheduler   0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  49m (x141 over 165m)    cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}
  Normal   NotTriggerScaleUp  3m56s (x760 over 165m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector

Name:             cert-manager-webhook-7c7d969c76-8n9wq
Namespace:        cert-manager
Priority:         0
Service Account:  cert-manager-webhook
Node:             <none>
Labels:           app=webhook
                  app.kubernetes.io/component=webhook
                  app.kubernetes.io/instance=cert-manager
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=webhook
                  app.kubernetes.io/version=v1.13.3
                  helm.sh/chart=cert-manager-v1.13.3
                  pod-template-hash=7c7d969c76
Annotations:      <none>
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Controlled By:    ReplicaSet/cert-manager-webhook-7c7d969c76
Containers:
  cert-manager-webhook:
    Image:       quay.io/jetstack/cert-manager-webhook:v1.13.3
    Ports:       10250/TCP, 6080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --v=2
      --secure-port=10250
      --dynamic-serving-ca-secret-namespace=$(POD_NAMESPACE)
      --dynamic-serving-ca-secret-name=cert-manager-webhook-ca
      --dynamic-serving-dns-names=cert-manager-webhook
      --dynamic-serving-dns-names=cert-manager-webhook.$(POD_NAMESPACE)
      --dynamic-serving-dns-names=cert-manager-webhook.$(POD_NAMESPACE).svc
    Liveness:   http-get http://:6080/livez delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:6080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:  cert-manager (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2k88z (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-2k88z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Warning  FailedScheduling   14m (x30 over 159m)    default-scheduler   0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  19m (x177 over 165m)   cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}
  Normal   NotTriggerScaleUp  4m7s (x764 over 165m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector

1fabi0 commented 4 months ago

It seems like you are running on one single system node that is linux based either scale this up or see this stack overflow question

Maybe I should mention here the cert manager and ingress nginx run on the linux nodes of the kubernetes cluster.

osamabinsaleem commented 4 months ago

I believe I've two nodes.

p.s I changed the node size to a smaller machine while creating this. Can that have an effect?

1fabi0 commented 4 months ago

Ok that's good, are you sure all nodes are running 🤔 you can check how you're nodes are tagged etc with kubectl get nodes and kubectl describe node xxxxx

1fabi0 commented 4 months ago

Also the Taints on you're nodepool seem to be the problem, do you know why the nodes have the Taints CriticalAddonsOnly=true and NoSchedule

osamabinsaleem commented 4 months ago

I believe nodes are running:

kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-agentpool-25023989-vmss000002   Ready    agent   5h31m   v1.27.9
aks-agentpool-25023989-vmss000003   Ready    agent   5h31m   v1.27.9
aksscale000001                      Ready    agent   5h29m   v1.27.9

osamabinsaleem commented 4 months ago

@1fabi0 I'm not sure how the nodepool got those taints. I mostly selected the default values while creating the cluster. Should I create another one with separate config?

1fabi0 commented 4 months ago

No you don't need to create a new nodepool, I think you can untaint the nodepool. I think this az command will do the trick.

osamabinsaleem commented 3 months ago

I untainted the pod and now I see this. I believe its fixed. Capture

LM-Development / aks-sample

cert bot installation failing #33