Closed osamabinsaleem closed 3 months ago
That's interesting, can you check the health of the cert manager on the aks? It seems like it fails to launch cert manager after installation.
@1fabi0 I've executed a couple commands for this and looks like the cert-manager's status is still pending
kubectl get pods --namespace cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-8db45d64b-k2cfr 0/1 Pending 0 164m
cert-manager-cainjector-5c8d6f6646-pkmgq 0/1 Pending 0 164m
cert-manager-startupapicheck-rmdbm 0/1 Pending 0 164m
cert-manager-webhook-7c7d969c76-8n9wq 0/1 Pending 0 164m
and also
>kubectl describe pods --namespace cert-manager
Name: cert-manager-8db45d64b-k2cfr
Namespace: cert-manager
Priority: 0
Service Account: cert-manager
Node: <none>
Labels: app=cert-manager
app.kubernetes.io/component=controller
app.kubernetes.io/instance=cert-manager
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=cert-manager
app.kubernetes.io/version=v1.13.3
helm.sh/chart=cert-manager-v1.13.3
pod-template-hash=8db45d64b
Annotations: prometheus.io/path: /metrics
prometheus.io/port: 9402
prometheus.io/scrape: true
Status: Pending
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Controlled By: ReplicaSet/cert-manager-8db45d64b
Containers:
cert-manager-controller:
Image: quay.io/jetstack/cert-manager-controller:v1.13.3
Ports: 9402/TCP, 9403/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--v=2
--cluster-resource-namespace=$(POD_NAMESPACE)
--leader-election-namespace=kube-system
--acme-http01-solver-image=quay.io/jetstack/cert-manager-acmesolver:v1.13.3
--max-concurrent-challenges=60
Environment:
POD_NAMESPACE: cert-manager (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-smzkj (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-smzkj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14m (x30 over 159m) default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 9m8s (x745 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector
Normal NotTriggerScaleUp 4m5s (x192 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}
Name: cert-manager-cainjector-5c8d6f6646-pkmgq
Namespace: cert-manager
Priority: 0
Service Account: cert-manager-cainjector
Node: <none>
Labels: app=cainjector
app.kubernetes.io/component=cainjector
app.kubernetes.io/instance=cert-manager
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=cainjector
app.kubernetes.io/version=v1.13.3
helm.sh/chart=cert-manager-v1.13.3
pod-template-hash=5c8d6f6646
Annotations: <none>
Status: Pending
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Controlled By: ReplicaSet/cert-manager-cainjector-5c8d6f6646
Containers:
cert-manager-cainjector:
Image: quay.io/jetstack/cert-manager-cainjector:v1.13.3
Port: <none>
Host Port: <none>
Args:
--v=2
--leader-election-namespace=kube-system
Environment:
POD_NAMESPACE: cert-manager (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5lkw8 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-5lkw8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14m (x30 over 159m) default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 9m9s (x737 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector
Normal NotTriggerScaleUp 4m6s (x201 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}
Name: cert-manager-startupapicheck-rmdbm
Namespace: cert-manager
Priority: 0
Service Account: cert-manager-startupapicheck
Node: <none>
Labels: app=startupapicheck
app.kubernetes.io/component=startupapicheck
app.kubernetes.io/instance=cert-manager
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=startupapicheck
app.kubernetes.io/version=v1.13.3
batch.kubernetes.io/controller-uid=ac0949e4-7d18-4b9c-9b02-2f865a211c13
batch.kubernetes.io/job-name=cert-manager-startupapicheck
controller-uid=ac0949e4-7d18-4b9c-9b02-2f865a211c13
helm.sh/chart=cert-manager-v1.13.3
job-name=cert-manager-startupapicheck
Annotations: <none>
Status: Pending
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Controlled By: Job/cert-manager-startupapicheck
Containers:
cert-manager-startupapicheck:
Image: quay.io/jetstack/cert-manager-ctl:v1.13.3
Port: <none>
Host Port: <none>
Args:
check
api
--wait=1m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-82fc6 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-82fc6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14m (x30 over 159m) default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 49m (x141 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}
Normal NotTriggerScaleUp 3m56s (x760 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector
Name: cert-manager-webhook-7c7d969c76-8n9wq
Namespace: cert-manager
Priority: 0
Service Account: cert-manager-webhook
Node: <none>
Labels: app=webhook
app.kubernetes.io/component=webhook
app.kubernetes.io/instance=cert-manager
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=webhook
app.kubernetes.io/version=v1.13.3
helm.sh/chart=cert-manager-v1.13.3
pod-template-hash=7c7d969c76
Annotations: <none>
Status: Pending
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Controlled By: ReplicaSet/cert-manager-webhook-7c7d969c76
Containers:
cert-manager-webhook:
Image: quay.io/jetstack/cert-manager-webhook:v1.13.3
Ports: 10250/TCP, 6080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--v=2
--secure-port=10250
--dynamic-serving-ca-secret-namespace=$(POD_NAMESPACE)
--dynamic-serving-ca-secret-name=cert-manager-webhook-ca
--dynamic-serving-dns-names=cert-manager-webhook
--dynamic-serving-dns-names=cert-manager-webhook.$(POD_NAMESPACE)
--dynamic-serving-dns-names=cert-manager-webhook.$(POD_NAMESPACE).svc
Liveness: http-get http://:6080/livez delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:6080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
Environment:
POD_NAMESPACE: cert-manager (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2k88z (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-2k88z:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14m (x30 over 159m) default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 19m (x177 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {CriticalAddonsOnly: true}
Normal NotTriggerScaleUp 4m7s (x764 over 165m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had untolerated taint {CriticalAddonsOnly: true}, 1 node(s) didn't match Pod's node affinity/selector
It seems like you are running on one single system node that is linux based either scale this up or see this stack overflow question
Maybe I should mention here the cert manager and ingress nginx run on the linux nodes of the kubernetes cluster.
I believe I've two nodes.
p.s I changed the node size to a smaller machine while creating this. Can that have an effect?
Ok that's good, are you sure all nodes are running 🤔 you can check how you're nodes are tagged etc with kubectl get nodes
and kubectl describe node xxxxx
Also the Taints on you're nodepool seem to be the problem, do you know why the nodes have the Taints CriticalAddonsOnly=true
and NoSchedule
I believe nodes are running:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-agentpool-25023989-vmss000002 Ready agent 5h31m v1.27.9
aks-agentpool-25023989-vmss000003 Ready agent 5h31m v1.27.9
aksscale000001 Ready agent 5h29m v1.27.9
@1fabi0 I'm not sure how the nodepool got those taints. I mostly selected the default values while creating the cluster. Should I create another one with separate config?
No you don't need to create a new nodepool, I think you can untaint the nodepool. I think this az command will do the trick.
I untainted the pod and now I see this. I believe its fixed.
When I run the install.bat script provided here: https://github.com/LM-Development/aks-sample/tree/main/Samples/PublicSamples/RecordingBot/deploy/cert-manager I get the following error:
I've also tried increasing the timeout period to wait more for the cluster to start the pods like this:
kubectl wait pod -n cert-manager --for condition=ready --timeout=300s --all
But I still get the same error.