aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.26k stars 738 forks source link

network fails to start with 64vcpu, 96vcpu instances on larger clusters #1936

Closed matti closed 2 years ago

matti commented 2 years ago

What happened:

Launched 12,000 pods and got ~600 instances and first nodes started okay, but after kubernetes master started upgrading (connectivity issues) then some/none did register on EKS, but failed to start networking. Tried deleting all not-ready nodes but the replacement nodes also failed to start networking.

Attach logs

eksctl created ipv6 1.21 cluster (see yamls at the end)

this instance types: c5.4xlarge, c5a.8xlarge,... (so not the #1935 )

in kubelet logs we can see that node did register, but fails to init network. kubelet logs also show a lot timeouts.

Error updating node status, will retry" err="error getting node \"ip-192-168-91-143.eu-north-1.compute.internal\": Get \"https://5360904D5FF2A149DD9A49F4160B8231.yl4.eu-north-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-91-143.eu-north-1.compute.internal?resourceVersion=0&timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

During this time kubernetes master was temporarily unavailable (scaling up due to 500+ nodes and 7000+ pods ?)

also docker ps shows no containers.

full kubelet logs:

journalctl -u kubelet | cut -d: -f4- | cut -d' ' -f3-
--
Kubernetes Kubelet...
Kubernetes Kubelet.
--cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
--network-plugin has been deprecated, will be removed along with dockershim.
21:34:42.534649    4868 flags.go:59] FLAG: --add-dir-header="false"
21:34:42.534674    4868 flags.go:59] FLAG: --address="0.0.0.0"
21:34:42.534678    4868 flags.go:59] FLAG: --allowed-unsafe-sysctls="[]"
21:34:42.534687    4868 flags.go:59] FLAG: --alsologtostderr="false"
21:34:42.534690    4868 flags.go:59] FLAG: --anonymous-auth="true"
21:34:42.534693    4868 flags.go:59] FLAG: --application-metrics-count-limit="100"
21:34:42.534696    4868 flags.go:59] FLAG: --authentication-token-webhook="false"
21:34:42.534699    4868 flags.go:59] FLAG: --authentication-token-webhook-cache-ttl="2m0s"
21:34:42.534703    4868 flags.go:59] FLAG: --authorization-mode="AlwaysAllow"
21:34:42.534706    4868 flags.go:59] FLAG: --authorization-webhook-cache-authorized-ttl="5m0s"
21:34:42.534709    4868 flags.go:59] FLAG: --authorization-webhook-cache-unauthorized-ttl="30s"
21:34:42.534712    4868 flags.go:59] FLAG: --azure-container-registry-config=""
21:34:42.534715    4868 flags.go:59] FLAG: --boot-id-file="/proc/sys/kernel/random/boot_id"
21:34:42.534718    4868 flags.go:59] FLAG: --bootstrap-kubeconfig=""
21:34:42.534720    4868 flags.go:59] FLAG: --cert-dir="/var/lib/kubelet/pki"
21:34:42.534723    4868 flags.go:59] FLAG: --cgroup-driver="cgroupfs"
21:34:42.534728    4868 flags.go:59] FLAG: --cgroup-root=""
21:34:42.534731    4868 flags.go:59] FLAG: --cgroups-per-qos="true"
21:34:42.534734    4868 flags.go:59] FLAG: --chaos-chance="0"
21:34:42.534739    4868 flags.go:59] FLAG: --client-ca-file=""
21:34:42.534741    4868 flags.go:59] FLAG: --cloud-config=""
21:34:42.534743    4868 flags.go:59] FLAG: --cloud-provider="aws"
21:34:42.534746    4868 flags.go:59] FLAG: --cluster-dns="[]"
21:34:42.534751    4868 flags.go:59] FLAG: --cluster-domain=""
21:34:42.534754    4868 flags.go:59] FLAG: --cni-bin-dir="/opt/cni/bin"
21:34:42.534757    4868 flags.go:59] FLAG: --cni-cache-dir="/var/lib/cni/cache"
21:34:42.534760    4868 flags.go:59] FLAG: --cni-conf-dir="/etc/cni/net.d"
21:34:42.534763    4868 flags.go:59] FLAG: --config="/etc/kubernetes/kubelet/kubelet-config.json"
21:34:42.534766    4868 flags.go:59] FLAG: --container-hints="/etc/cadvisor/container_hints.json"
21:34:42.534769    4868 flags.go:59] FLAG: --container-log-max-files="5"
21:34:42.534774    4868 flags.go:59] FLAG: --container-log-max-size="10Mi"
21:34:42.534777    4868 flags.go:59] FLAG: --container-runtime="docker"
21:34:42.534780    4868 flags.go:59] FLAG: --container-runtime-endpoint="unix:///var/run/dockershim.sock"
21:34:42.534783    4868 flags.go:59] FLAG: --containerd="/run/containerd/containerd.sock"
21:34:42.534786    4868 flags.go:59] FLAG: --containerd-namespace="k8s.io"
21:34:42.534791    4868 flags.go:59] FLAG: --contention-profiling="false"
21:34:42.534794    4868 flags.go:59] FLAG: --cpu-cfs-quota="true"
21:34:42.534797    4868 flags.go:59] FLAG: --cpu-cfs-quota-period="100ms"
21:34:42.534800    4868 flags.go:59] FLAG: --cpu-manager-policy="none"
21:34:42.534803    4868 flags.go:59] FLAG: --cpu-manager-reconcile-period="10s"
21:34:42.534806    4868 flags.go:59] FLAG: --docker="unix:///var/run/docker.sock"
21:34:42.534809    4868 flags.go:59] FLAG: --docker-endpoint="unix:///var/run/docker.sock"
21:34:42.534812    4868 flags.go:59] FLAG: --docker-env-metadata-whitelist=""
21:34:42.534814    4868 flags.go:59] FLAG: --docker-only="false"
21:34:42.534817    4868 flags.go:59] FLAG: --docker-root="/var/lib/docker"
21:34:42.534820    4868 flags.go:59] FLAG: --docker-tls="false"
21:34:42.534823    4868 flags.go:59] FLAG: --docker-tls-ca="ca.pem"
21:34:42.534825    4868 flags.go:59] FLAG: --docker-tls-cert="cert.pem"
21:34:42.534828    4868 flags.go:59] FLAG: --docker-tls-key="key.pem"
21:34:42.534831    4868 flags.go:59] FLAG: --dynamic-config-dir=""
21:34:42.534834    4868 flags.go:59] FLAG: --enable-controller-attach-detach="true"
21:34:42.534837    4868 flags.go:59] FLAG: --enable-debugging-handlers="true"
21:34:42.534840    4868 flags.go:59] FLAG: --enable-load-reader="false"
21:34:42.534842    4868 flags.go:59] FLAG: --enable-server="true"
21:34:42.534845    4868 flags.go:59] FLAG: --enforce-node-allocatable="[pods]"
21:34:42.534853    4868 flags.go:59] FLAG: --event-burst="10"
21:34:42.534855    4868 flags.go:59] FLAG: --event-qps="5"
21:34:42.534864    4868 flags.go:59] FLAG: --event-storage-age-limit="default=0"
21:34:42.534866    4868 flags.go:59] FLAG: --event-storage-event-limit="default=0"
21:34:42.534869    4868 flags.go:59] FLAG: --eviction-hard="imagefs.available<15%,memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%"
21:34:42.534878    4868 flags.go:59] FLAG: --eviction-max-pod-grace-period="0"
21:34:42.534881    4868 flags.go:59] FLAG: --eviction-minimum-reclaim=""
21:34:42.534885    4868 flags.go:59] FLAG: --eviction-pressure-transition-period="5m0s"
21:34:42.534888    4868 flags.go:59] FLAG: --eviction-soft=""
21:34:42.534891    4868 flags.go:59] FLAG: --eviction-soft-grace-period=""
21:34:42.534894    4868 flags.go:59] FLAG: --exit-on-lock-contention="false"
21:34:42.534896    4868 flags.go:59] FLAG: --experimental-allocatable-ignore-eviction="false"
21:34:42.534899    4868 flags.go:59] FLAG: --experimental-bootstrap-kubeconfig=""
21:34:42.534902    4868 flags.go:59] FLAG: --experimental-check-node-capabilities-before-mount="false"
21:34:42.534905    4868 flags.go:59] FLAG: --experimental-dockershim-root-directory="/var/lib/dockershim"
21:34:42.534908    4868 flags.go:59] FLAG: --experimental-kernel-memcg-notification="false"
21:34:42.534910    4868 flags.go:59] FLAG: --experimental-logging-sanitization="false"
21:34:42.534913    4868 flags.go:59] FLAG: --experimental-mounter-path=""
21:34:42.534918    4868 flags.go:59] FLAG: --fail-swap-on="true"
21:34:42.534921    4868 flags.go:59] FLAG: --feature-gates=""
21:34:42.534925    4868 flags.go:59] FLAG: --file-check-frequency="20s"
21:34:42.534927    4868 flags.go:59] FLAG: --global-housekeeping-interval="1m0s"
21:34:42.534930    4868 flags.go:59] FLAG: --hairpin-mode="promiscuous-bridge"
21:34:42.534933    4868 flags.go:59] FLAG: --healthz-bind-address="127.0.0.1"
21:34:42.534935    4868 flags.go:59] FLAG: --healthz-port="10248"
21:34:42.534938    4868 flags.go:59] FLAG: --help="false"
21:34:42.534941    4868 flags.go:59] FLAG: --hostname-override=""
21:34:42.534943    4868 flags.go:59] FLAG: --housekeeping-interval="10s"
21:34:42.534946    4868 flags.go:59] FLAG: --http-check-frequency="20s"
21:34:42.534949    4868 flags.go:59] FLAG: --image-credential-provider-bin-dir=""
21:34:42.534952    4868 flags.go:59] FLAG: --image-credential-provider-config=""
21:34:42.534954    4868 flags.go:59] FLAG: --image-gc-high-threshold="85"
21:34:42.534957    4868 flags.go:59] FLAG: --image-gc-low-threshold="80"
21:34:42.534959    4868 flags.go:59] FLAG: --image-pull-progress-deadline="1m0s"
21:34:42.534967    4868 flags.go:59] FLAG: --image-service-endpoint=""
21:34:42.534970    4868 flags.go:59] FLAG: --iptables-drop-bit="15"
21:34:42.534972    4868 flags.go:59] FLAG: --iptables-masquerade-bit="14"
21:34:42.534975    4868 flags.go:59] FLAG: --keep-terminated-pod-volumes="false"
21:34:42.534980    4868 flags.go:59] FLAG: --kernel-memcg-notification="false"
21:34:42.534983    4868 flags.go:59] FLAG: --kube-api-burst="10"
21:34:42.534985    4868 flags.go:59] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
21:34:42.534989    4868 flags.go:59] FLAG: --kube-api-qps="5"
21:34:42.534991    4868 flags.go:59] FLAG: --kube-reserved=""
21:34:42.534994    4868 flags.go:59] FLAG: --kube-reserved-cgroup=""
21:34:42.534997    4868 flags.go:59] FLAG: --kubeconfig="/var/lib/kubelet/kubeconfig"
21:34:42.535000    4868 flags.go:59] FLAG: --kubelet-cgroups=""
21:34:42.535002    4868 flags.go:59] FLAG: --lock-file=""
21:34:42.535005    4868 flags.go:59] FLAG: --log-backtrace-at=":0"
21:34:42.535008    4868 flags.go:59] FLAG: --log-cadvisor-usage="false"
21:34:42.535010    4868 flags.go:59] FLAG: --log-dir=""
21:34:42.535013    4868 flags.go:59] FLAG: --log-file=""
21:34:42.535016    4868 flags.go:59] FLAG: --log-file-max-size="1800"
21:34:42.535019    4868 flags.go:59] FLAG: --log-flush-frequency="5s"
21:34:42.535022    4868 flags.go:59] FLAG: --logging-format="text"
21:34:42.535024    4868 flags.go:59] FLAG: --logtostderr="true"
21:34:42.535027    4868 flags.go:59] FLAG: --machine-id-file="/etc/machine-id,/var/lib/dbus/machine-id"
21:34:42.535030    4868 flags.go:59] FLAG: --make-iptables-util-chains="true"
21:34:42.535035    4868 flags.go:59] FLAG: --manifest-url=""
21:34:42.535038    4868 flags.go:59] FLAG: --manifest-url-header=""
21:34:42.535042    4868 flags.go:59] FLAG: --master-service-namespace="default"
21:34:42.535045    4868 flags.go:59] FLAG: --max-open-files="1000000"
21:34:42.535049    4868 flags.go:59] FLAG: --max-pods="110"
21:34:42.535051    4868 flags.go:59] FLAG: --maximum-dead-containers="-1"
21:34:42.535054    4868 flags.go:59] FLAG: --maximum-dead-containers-per-container="1"
21:34:42.535057    4868 flags.go:59] FLAG: --memory-manager-policy="None"
21:34:42.535059    4868 flags.go:59] FLAG: --minimum-container-ttl-duration="0s"
21:34:42.535066    4868 flags.go:59] FLAG: --minimum-image-ttl-duration="2m0s"
21:34:42.535069    4868 flags.go:59] FLAG: --network-plugin="cni"
21:34:42.535072    4868 flags.go:59] FLAG: --network-plugin-mtu="0"
21:34:42.535074    4868 flags.go:59] FLAG: --node-ip="2a05:d016:136:e02:0:0:0:a4c2"
21:34:42.535077    4868 flags.go:59] FLAG: --node-labels=""
21:34:42.535081    4868 flags.go:59] FLAG: --node-status-max-images="50"
21:34:42.535083    4868 flags.go:59] FLAG: --node-status-update-frequency="10s"
21:34:42.535086    4868 flags.go:59] FLAG: --non-masquerade-cidr="10.0.0.0/8"
21:34:42.535089    4868 flags.go:59] FLAG: --one-output="false"
21:34:42.535092    4868 flags.go:59] FLAG: --oom-score-adj="-999"
21:34:42.535095    4868 flags.go:59] FLAG: --pod-cidr=""
21:34:42.535099    4868 flags.go:59] FLAG: --pod-infra-container-image="602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1"
21:34:42.535103    4868 flags.go:59] FLAG: --pod-manifest-path=""
21:34:42.535105    4868 flags.go:59] FLAG: --pod-max-pids="-1"
21:34:42.535108    4868 flags.go:59] FLAG: --pods-per-core="0"
21:34:42.535111    4868 flags.go:59] FLAG: --port="10250"
21:34:42.535114    4868 flags.go:59] FLAG: --protect-kernel-defaults="false"
21:34:42.535117    4868 flags.go:59] FLAG: --provider-id=""
21:34:42.535120    4868 flags.go:59] FLAG: --qos-reserved=""
21:34:42.535123    4868 flags.go:59] FLAG: --read-only-port="10255"
21:34:42.535126    4868 flags.go:59] FLAG: --really-crash-for-testing="false"
21:34:42.535128    4868 flags.go:59] FLAG: --redirect-container-streaming="false"
21:34:42.535131    4868 flags.go:59] FLAG: --register-node="true"
21:34:42.535133    4868 flags.go:59] FLAG: --register-schedulable="true"
21:34:42.535136    4868 flags.go:59] FLAG: --register-with-taints=""
21:34:42.535139    4868 flags.go:59] FLAG: --registry-burst="10"
21:34:42.535142    4868 flags.go:59] FLAG: --registry-qps="5"
21:34:42.535144    4868 flags.go:59] FLAG: --reserved-cpus=""
21:34:42.535147    4868 flags.go:59] FLAG: --reserved-memory=""
21:34:42.535150    4868 flags.go:59] FLAG: --resolv-conf="/etc/resolv.conf"
21:34:42.535155    4868 flags.go:59] FLAG: --root-dir="/var/lib/kubelet"
21:34:42.535158    4868 flags.go:59] FLAG: --rotate-certificates="false"
21:34:42.535166    4868 flags.go:59] FLAG: --rotate-server-certificates="false"
21:34:42.535169    4868 flags.go:59] FLAG: --runonce="false"
21:34:42.535171    4868 flags.go:59] FLAG: --runtime-cgroups=""
21:34:42.535174    4868 flags.go:59] FLAG: --runtime-request-timeout="2m0s"
21:34:42.535177    4868 flags.go:59] FLAG: --seccomp-profile-root="/var/lib/kubelet/seccomp"
21:34:42.535180    4868 flags.go:59] FLAG: --serialize-image-pulls="true"
21:34:42.535182    4868 flags.go:59] FLAG: --skip-headers="false"
21:34:42.535185    4868 flags.go:59] FLAG: --skip-log-headers="false"
21:34:42.535187    4868 flags.go:59] FLAG: --stderrthreshold="2"
21:34:42.535190    4868 flags.go:59] FLAG: --storage-driver-buffer-duration="1m0s"
21:34:42.535193    4868 flags.go:59] FLAG: --storage-driver-db="cadvisor"
21:34:42.535196    4868 flags.go:59] FLAG: --storage-driver-host="localhost:8086"
21:34:42.535199    4868 flags.go:59] FLAG: --storage-driver-password="root"
21:34:42.535201    4868 flags.go:59] FLAG: --storage-driver-secure="false"
21:34:42.535204    4868 flags.go:59] FLAG: --storage-driver-table="stats"
21:34:42.535206    4868 flags.go:59] FLAG: --storage-driver-user="root"
21:34:42.535209    4868 flags.go:59] FLAG: --streaming-connection-idle-timeout="4h0m0s"
21:34:42.535212    4868 flags.go:59] FLAG: --sync-frequency="1m0s"
21:34:42.535217    4868 flags.go:59] FLAG: --system-cgroups=""
21:34:42.535219    4868 flags.go:59] FLAG: --system-reserved=""
21:34:42.535222    4868 flags.go:59] FLAG: --system-reserved-cgroup=""
21:34:42.535224    4868 flags.go:59] FLAG: --tls-cert-file=""
21:34:42.535227    4868 flags.go:59] FLAG: --tls-cipher-suites="[]"
21:34:42.535232    4868 flags.go:59] FLAG: --tls-min-version=""
21:34:42.535235    4868 flags.go:59] FLAG: --tls-private-key-file=""
21:34:42.535237    4868 flags.go:59] FLAG: --topology-manager-policy="none"
21:34:42.535240    4868 flags.go:59] FLAG: --topology-manager-scope="container"
21:34:42.535243    4868 flags.go:59] FLAG: --v="2"
21:34:42.535245    4868 flags.go:59] FLAG: --version="false"
21:34:42.535250    4868 flags.go:59] FLAG: --vmodule=""
21:34:42.535252    4868 flags.go:59] FLAG: --volume-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/volume/exec/"
21:34:42.535256    4868 flags.go:59] FLAG: --volume-stats-agg-period="1m0s"
21:34:42.535306    4868 feature_gate.go:243] feature gates: &{map[]}
--cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
--network-plugin has been deprecated, will be removed along with dockershim.
21:34:42.540668    4868 feature_gate.go:243] feature gates: &{map[RotateKubeletServerCertificate:true]}
21:34:42.540754    4868 feature_gate.go:243] feature gates: &{map[RotateKubeletServerCertificate:true]}
21:34:42.548352    4868 mount_linux.go:221] Detected OS with systemd
21:34:42.549176    4868 server.go:440] "Kubelet version" kubeletVersion="v1.21.5-eks-9017834"
21:34:42.549233    4868 feature_gate.go:243] feature gates: &{map[RotateKubeletServerCertificate:true]}
21:34:42.549290    4868 feature_gate.go:243] feature gates: &{map[RotateKubeletServerCertificate:true]}
21:34:42.549372    4868 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release
21:34:42.553835    4868 aws.go:1257] Get AWS region from metadata client
21:34:42.553971    4868 aws.go:1302] Zone not specified in configuration file; querying AWS metadata service
21:34:42.558475    4868 aws.go:1342] Building AWS cloudprovider
21:34:42.802894    4868 tags.go:79] AWS cloud filtering on ClusterID: north-1
21:34:42.802964    4868 server.go:552] "Successfully initialized cloud provider" cloudProvider="aws" cloudConfigFile=""
21:34:42.802995    4868 server.go:989] "Cloud provider determined current node" nodeName="ip-192-168-91-143.eu-north-1.compute.internal"
21:34:42.828083    4868 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca-bundle::/etc/kubernetes/pki/ca.crt"
21:34:42.828327    4868 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
21:34:42.830252    4868 manager.go:165] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct/system.slice/kubelet.service"
21:34:42.880256    4868 fs.go:131] Filesystem UUIDs: map[2d806076-c454-481a-97a9-dddd6dddc584:/dev/nvme0n1p1]
21:34:42.880280    4868 fs.go:132] Filesystem partitions: map[/dev/nvme0n1p1:{mountpoint:/ major:259 minor:1 fsType:xfs blockSize:0} /dev/shm:{mountpoint:/dev/shm major:0 minor:21 fsType:tmpfs blockSize:0} /run:{mountpoint:/run major:0 minor:23 fsType:tmpfs blockSize:0} /sys/fs/cgroup:{mountpoint:/sys/fs/cgroup major:0 minor:24 fsType:tmpfs blockSize:0}]
21:34:42.881962    4868 nvidia.go:61] NVIDIA setup failed: no NVIDIA devices found
21:34:42.906616    4868 manager.go:213] Machine: {Timestamp:2022-03-20 21:34:42.905810904 +0000 UTC m=+0.801872831 NumCores:16 NumPhysicalCores:8 NumSockets:1 CpuFrequency:1200250 MemoryCapacity:32599482368 MemoryByType:map[]NVMInfo:{MemoryModeCapacity:0 AppDirectModeCapacity:0 AvgPowerBudget:0} HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] MachineID:ec26be713de0e885907f97646f6530b8 SystemUUID:ec26be71-3de0-e885-907f-97646f6530b8 BootID:7e07d24c-29d8-44e3-b894-ac8efcb7b8f0 Filesystems:[{Device:/run DeviceMajor:0 DeviceMinor:23 Capacity:16299741184 Type:vfs Inodes:3979429 HasInodes:true} {Device:/sys/fs/cgroup DeviceMajor:0 DeviceMinor:24 Capacity:16299741184 Type:vfs Inodes:3979429 HasInodes:true} {Device:/dev/nvme0n1p1 DeviceMajor:259 DeviceMinor:1 Capacity:8577331200 Type:vfs Inodes:4193264 HasInodes:true} {Device:/dev/shm DeviceMajor:0 DeviceMinor:21 Capacity:16299741184 Type:vfs Inodes:3979429 HasInodes:true}] DiskMap:map[259:0:{Name:nvme0n1 Major:259 Minor:0 Size:8589934592 Scheduler:none}] NetworkDevices:[{Name:eth0 MacAddress:06:d3:9a:ad:f0:6e Speed:0 Mtu:9001}] Topology:[{Id:0 Memory:32599482368 HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] Cores:[{Id:0 Threads:[0 8] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:1 Threads:[1 9] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:2 Threads:[10 2] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:3 Threads:[11 3] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:4 Threads:[12 4] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:5 Threads:[13 5] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:6 Threads:[14 6] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:7 Threads:[15 7] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0}] Caches:[{Size:25952256 Type:Unified Level:3}]}] CloudProvider:AWS InstanceType:c5.4xlargeInstanceID:i-08dea8269f30f7537}
21:34:42.906733    4868 manager_no_libpfm.go:28] cAdvisor is build without cgo and/or libpfm support. Perf event counters are not available.
21:34:42.920131    4868 manager.go:229] Version: {KernelVersion:5.4.181-99.354.amzn2.x86_64 ContainerOsVersion:Amazon Linux 2 DockerVersion:20.10.7 DockerAPIVersion:1.41 CadvisorVersion: CadvisorRevision:}
21:34:42.921613    4868 container_manager_linux.go:278] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
21:34:42.921825    4868 container_manager_linux.go:283] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:docker CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:true NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[cpu:{i:{value:110 scale:-3} d:{Dec:<nil>} s:110m Format:DecimalSI} ephemeral-storage:{i:{value:1073741824 scale:0} d:{Dec:<nil>} s:1Gi Format:BinarySI} memory:{i:{value:2966421504 scale:0} d:{Dec:<nil>} s:2829Mi Format:BinarySI}] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
21:34:42.922948    4868 topology_manager.go:120] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
21:34:42.922960    4868 container_manager_linux.go:314] "Initializing Topology Manager" policy="none" scope="container"
21:34:42.922967    4868 container_manager_linux.go:319] "Creating device plugin manager" devicePluginEnabled=true
21:34:42.922978    4868 manager.go:136] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
21:34:42.923784    4868 kubelet.go:307] "Using dockershim is deprecated, please consider using a full-fledged CRI implementation"
21:34:42.924574    4868 client.go:78] "Connecting to docker on the dockerEndpoint" endpoint="unix:///var/run/docker.sock"
21:34:42.924605    4868 client.go:97] "Start docker client with request timeout" timeout="2m0s"
21:34:42.934017    4868 docker_service.go:242] "Hairpin mode is set" hairpinMode=hairpin-veth
21:34:42.934253    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:34:42.936112    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:34:42.936132    4868 plugins.go:168] "Loaded network plugin" networkPluginName="cni"
21:34:42.936188    4868 docker_service.go:257] "Docker cri networking managed by the network plugin" networkPluginName="cni"
21:34:42.936235    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:34:42.944143    4868 docker_service.go:264] "Docker Info" dockerInfo=&{ID:JH44:4W3Q:UPLR:4SJH:JWHQ:OSQA:56DT:3NQ3:BJKX:KC26:CINR:FE33 Containers:0 ContainersRunning:0 ContainersPaused:0 ContainersStopped:0 Images:0 Driver:overlay2 DriverStatus:[[Backing Filesystem xfs] [Supports d_type true] [Native Overlay Diff true] [userxattr false]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host ipvlan macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:true KernelMemory:true KernelMemoryTCP:true CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:true IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:21 OomKillDisable:true NGoroutines:35 SystemTime:2022-03-20T21:34:42.936622619Z LoggingDriver:json-file CgroupDriver:cgroupfs CgroupVersion:1 NEventsListener:0 KernelVersion:5.4.181-99.354.amzn2.x86_64 OperatingSystem:Amazon Linux 2 OSVersion:2 OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc0007bc150 NCPU:16 MemTotal:32599482368 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:ip-192-168-91-143.eu-north-1.compute.internal Labels:[] ExperimentalBuild:false ServerVersion:20.10.7 ClusterStore: ClusterAdvertise: Runtimes:map[io.containerd.runc.v2:{Path:runc Args:[] Shim:<nil>} io.containerd.runtime.v1.linux:{Path:runc Args:[] Shim:<nil>} runc:{Path:runc Args:[] Shim:<nil>}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil> Warnings:[]} LiveRestoreEnabled:true Isolation: InitBinary:docker-init ContainerdCommit:{ID:d71fcd7d8303cbf684402823e425e9dd2e99285d Expected:d71fcd7d8303cbf684402823e425e9dd2e99285d} RuncCommit:{ID:84113eef6fc27af1b01b3181f31bbaf708715301 Expected:84113eef6fc27af1b01b3181f31bbaf708715301} InitCommit:{ID:de40ad0 Expected:de40ad0} SecurityOptions:[name=seccomp,profile=default] ProductLicense: DefaultAddressPools:[] Warnings:[]}
21:34:42.944162    4868 docker_service.go:277] "Setting cgroupDriver" cgroupDriver="cgroupfs"
21:34:42.944225    4868 kubelet_dockershim.go:62] "Starting the GRPC server for the docker CRI shim."
21:34:42.944242    4868 docker_server.go:62] "Start dockershim grpc server"
21:34:42.952852    4868 server.go:989] "Cloud provider determined current node" nodeName="ip-192-168-91-143.eu-north-1.compute.internal"
21:34:42.952873    4868 server.go:1131] "Using root directory" path="/var/lib/kubelet"
21:34:42.954960    4868 kubelet.go:404] "Attempting to sync node with API server"
21:34:42.955007    4868 kubelet.go:283] "Adding apiserver pod source"
21:34:42.955036    4868 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
21:34:42.967319    4868 kuberuntime_manager.go:222] "Container runtime initialized" containerRuntime="docker" version="20.10.7" apiVersion="1.41.0"
21:34:42.971005    4868 probe.go:268] Flexvolume plugin directory at /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ does not exist. Recreating.
21:34:42.971237    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/azure-disk"
21:34:42.971249    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/azure-file"
21:34:42.971255    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/vsphere-volume"
21:34:42.971260    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/aws-ebs"
21:34:42.971265    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/gce-pd"
21:34:42.971271    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/cinder"
21:34:42.971979    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/empty-dir"
21:34:42.971989    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/git-repo"
21:34:42.971997    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/host-path"
21:34:42.972003    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/nfs"
21:34:42.972018    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/secret"
21:34:42.972026    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/iscsi"
21:34:42.972031    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/glusterfs"
21:34:42.972040    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/rbd"
21:34:42.972045    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/quobyte"
21:34:42.972053    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/cephfs"
21:34:42.972061    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/downward-api"
21:34:42.972067    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/fc"
21:34:42.972072    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/flocker"
21:34:42.972077    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/configmap"
21:34:42.972083    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/projected"
21:34:42.972095    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/portworx-volume"
21:34:42.972101    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/scaleio"
21:34:42.972106    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/local-volume"
21:34:42.972111    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/storageos"
21:34:42.972388    4868 plugins.go:639] Loaded volume plugin "kubernetes.io/csi"
21:34:42.972459    4868 server.go:1190] "Started kubelet"
21:34:42.972506    4868 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
21:34:42.972528    4868 kubelet.go:1309] "Image garbage collection failed once. Stats initialization may not have completed yet" err="failed to get imageFs info: unable to find data in memory cache"
21:34:42.973807    4868 server.go:409] "Adding debug handlers to kubelet server"
21:34:42.974512    4868 certificate_manager.go:282] Certificate rotation is enabled.
21:34:42.974554    4868 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
21:34:42.974659    4868 volume_manager.go:269] "The desired_state_of_world populator starts"
21:34:42.974671    4868 volume_manager.go:271] "Starting Kubelet Volume Manager"
21:34:42.974697    4868 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
21:34:42.982415    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:34:42.996394    4868 factory.go:372] Registering Docker factory
21:34:42.997647    4868 factory.go:137] Registering containerd factory
21:34:42.997712    4868 factory.go:55] Registering systemd factory
21:34:42.997764    4868 factory.go:101] Registering Raw factory
21:34:42.997803    4868 manager.go:1203] Started watching for new ooms in manager
21:34:42.998193    4868 manager.go:301] Starting recovery of all containers
21:34:43.003181    4868 manager.go:306] Recovery completed
21:34:43.010783    4868 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
21:34:43.032449    4868 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
21:34:43.032469    4868 status_manager.go:157] "Starting to sync pod status with apiserver"
21:34:43.032482    4868 kubelet.go:1849] "Starting kubelet main sync loop"
21:34:43.032516    4868 kubelet.go:1873] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
21:34:43.032639    4868 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
21:34:43.032655    4868 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="c5.4xlarge"
21:34:43.032663    4868 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="c5.4xlarge"
21:34:43.032670    4868 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-north-1a"
21:34:43.032676    4868 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-north-1a"
21:34:43.032684    4868 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu-north-1"
21:34:43.032690    4868 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-north-1"
21:34:43.053659    4868 kubelet_node_status.go:554] "Recording event message for node" node="ip-192-168-91-143.eu-north-1.compute.internal" event="NodeHasSufficientMemory"
21:34:43.053692    4868 kubelet_node_status.go:554] "Recording event message for node" node="ip-192-168-91-143.eu-north-1.compute.internal" event="NodeHasNoDiskPressure"
21:34:43.053702    4868 kubelet_node_status.go:554] "Recording event message for node" node="ip-192-168-91-143.eu-north-1.compute.internal" event="NodeHasSufficientPID"
21:34:43.054364    4868 cpu_manager.go:199] "Starting CPU manager" policy="none"
21:34:43.054377    4868 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
21:34:43.054391    4868 state_mem.go:36] "Initialized new in-memory state store"
21:34:43.059888    4868 policy_none.go:44] "None policy: Start"
21:34:43.075578    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.075588    4868 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
21:34:43.075638    4868 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="c5.4xlarge"
21:34:43.075657    4868 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="c5.4xlarge"
21:34:43.075675    4868 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="eu-north-1a"
21:34:43.075690    4868 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="eu-north-1a"
21:34:43.075710    4868 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="eu-north-1"
21:34:43.075726    4868 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="eu-north-1"
21:34:43.087295    4868 manager.go:242] "Starting Device Plugin manager"
21:34:43.087401    4868 manager.go:600] "Failed to retrieve checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
21:34:43.087472    4868 container_manager_linux.go:510] "Discovered runtime cgroup name" cgroupName="/system.slice/docker.service"
21:34:43.087509    4868 manager.go:284] "Serving device plugin registration server on socket" path="/var/lib/kubelet/device-plugins/kubelet.sock"
21:34:43.087600    4868 plugin_watcher.go:52] "Plugin Watcher Start" path="/var/lib/kubelet/plugins_registry"
21:34:43.087645    4868 plugin_manager.go:112] "The desired_state_of_world populator (plugin watcher) starts"
21:34:43.087650    4868 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
21:34:43.088368    4868 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.095935    4868 kubelet_node_status.go:554] "Recording event message for node" node="ip-192-168-91-143.eu-north-1.compute.internal" event="NodeHasSufficientMemory"
21:34:43.095971    4868 kubelet_node_status.go:554] "Recording event message for node" node="ip-192-168-91-143.eu-north-1.compute.internal" event="NodeHasNoDiskPressure"
21:34:43.095986    4868 kubelet_node_status.go:554] "Recording event message for node" node="ip-192-168-91-143.eu-north-1.compute.internal" event="NodeHasSufficientPID"
21:34:43.096013    4868 kubelet_node_status.go:71] "Attempting to register node" node="ip-192-168-91-143.eu-north-1.compute.internal"
21:34:43.175678    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.275977    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.376395    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.476751    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.577678    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.677938    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.778402    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.878959    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:43.979574    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.080409    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.180961    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.281593    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.382297    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.482686    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.583101    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.683523    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.783982    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.884497    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:44.985193    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.086038    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.186633    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.287000    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.387502    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.487903    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.588316    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.688734    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.789140    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.889680    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:45.990267    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.091186    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.191606    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.291974    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.392517    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.492928    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.593340    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.693742    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.794143    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.894678    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:46.995317    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.096093    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.196680    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.296916    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.397431    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.497843    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.598396    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.698711    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.799020    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.899411    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:47.936560    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:34:48.000005    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.095626    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:34:48.100779    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.201068    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.301358    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.401746    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.502042    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.602627    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.702871    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.803141    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:48.903541    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.004448    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.104632    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.204912    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.305177    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.405569    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.505851    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.606135    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.706425    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.806482    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:49.906859    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.007743    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.108140    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.208425    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.308680    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.409069    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.509575    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.584338    4868 kubelet_node_status.go:74] "Successfully registered node" node="ip-192-168-91-143.eu-north-1.compute.internal"
21:34:50.609830    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.710098    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
21:34:50.960612    4868 apiserver.go:52] "Watching apiserver"
21:34:52.196304    4868 kubelet.go:1935] "SyncLoop ADD" source="api" pods=[]
21:34:52.315578    4868 reconciler.go:157] "Reconciler: start to sync state"
21:34:52.937454    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:34:52.975572    4868 transport.go:260] Unable to cancel request for *exec.roundTripper
21:34:52.975645    4868 controller.go:144] failed to ensure lease exists, will retry in 200ms, error: Get "https://5360904D5FF2A149DD9A49F4160B8231.yl4.eu-north-1.eks.amazonaws.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-192-168-91-143.eu-north-1.compute.internal?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
21:34:53.104144    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:34:57.571312    4868 log.go:184] http: TLS handshake error from [2a05:d016:136:e01:75f7::2]:35546: no serving certificate available for the kubelet
21:34:57.938150    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:34:58.114474    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:34:58.892423    4868 log.go:184] http: TLS handshake error from [2a05:d016:136:e01:75f7::3]:44108: no serving certificate available for the kubelet
21:35:00.584848    4868 transport.go:260] Unable to cancel request for *exec.roundTripper
21:35:00.585056    4868 kubelet_node_status.go:470] "Error updating node status, will retry" err="error getting node \"ip-192-168-91-143.eu-north-1.compute.internal\": Get \"https://5360904D5FF2A149DD9A49F4160B8231.yl4.eu-north-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-91-143.eu-north-1.compute.internal?resourceVersion=0&timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
21:35:02.939231    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:03.123990    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:03.985187    4868 certificate_manager.go:412] Rotating certificates
21:35:04.659426    4868 transport.go:260] Unable to cancel request for *exec.roundTripper
21:35:04.659612    4868 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="Get \"https://5360904D5FF2A149DD9A49F4160B8231.yl4.eu-north-1.eks.amazonaws.com/api/v1/nodes/ip-192-168-91-143.eu-north-1.compute.internal?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" node="ip-192-168-91-143.eu-north-1.compute.internal"
21:35:07.939710    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:08.133333    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:12.940056    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:13.143591    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:13.212962    4868 log.go:184] http: TLS handshake error from [2a05:d016:136:e01:75f7::3]:44222: no serving certificate available for the kubelet
21:35:13.617893    4868 log.go:184] http: TLS handshake error from [2a05:d016:136:e01:75f7::2]:35670: no serving certificate available for the kubelet
21:35:17.940882    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:18.153234    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:19.511693    4868 csr.go:245] certificate signing request csr-bwrgz is issued
21:35:20.816307    4868 certificate_manager.go:556] Certificate expiration is 2023-03-20 21:30:00 +0000 UTC, rotation deadline is 2023-02-10 21:45:36.587528103 +0000 UTC
21:35:20.816352    4868 certificate_manager.go:288] Waiting 7848h10m15.771182295s for next certificate rotation
21:35:21.817238    4868 certificate_manager.go:556] Certificate expiration is 2023-03-20 21:30:00 +0000 UTC, rotation deadline is 2022-12-23 20:15:14.148917091 +0000 UTC
21:35:21.817282    4868 certificate_manager.go:288] Waiting 6670h39m52.331642061s for next certificate rotation
21:35:22.941847    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:23.163638    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:27.942023    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:28.173835    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:32.943015    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:33.184190    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:37.943993    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:38.193466    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:42.944165    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:43.203010    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
21:35:47.945287    4868 cni.go:239] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
21:35:48.213304    4868 kubelet.go:2214] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
... repeats

after this I rebooted the instance and docker ps did show containers.

aws-node logs:

{"level":"info","ts":"2022-03-20T21:50:14.450Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-20T21:50:14.451Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-20T21:50:14.464Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-20T21:50:14.465Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-20T21:50:16.472Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:18.480Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:20.488Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:22.495Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:24.503Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:26.511Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:28.518Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-20T21:50:30.526Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
..repeats..

and ipamd.log:

$ cat /var/log/aws-routed-eni/ipamd.log
{"level":"info","ts":"2022-03-20T21:41:58.260Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-20T21:41:58.260Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-20T21:43:35.393Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-20T21:43:35.394Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-20T21:45:25.083Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-20T21:45:25.088Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-20T21:46:52.025Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-20T21:46:52.026Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-20T21:46:52.052Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-20T21:46:52.053Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}

What you expected to happen:

Node to become healthy

How to reproduce it (as minimally and precisely as possible):

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: 'test-1'
  region: 'eu-north-1'
  version: '1.21'

addons:
  - name: vpc-cni
    version: latest     # NOTE: has to be present, otherwise some random version
  - name: coredns
    version: latest     # NOTE: has to be present, otherwise some random version
  - name: kube-proxy
    version: latest     # NOTE: has to be present, otherwise some random version
iam:
  withOIDC: true

kubernetesNetworkConfig:
  ipFamily: IPv6
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: 'north-1'
  region: 'eu-north-1'

managedNodeGroups:
  - name: 'g-1-t-32-64-pre-p6-1-2022-03-20-22-28-09'
    instancePrefix: 'g-1'
    instanceName: 'g-1-t-32-64-pre-p6-1-2022-03-20-22-28-09'
    labels:
      test: 'yes'
    volumeType: 'gp3'
    volumeSize: 8
    volumeIOPS: 3000
    volumeThroughput: 125
    minSize: 0
    maxSize: 450
    spot: true
    iam:
      withAddonPolicies:
        autoScaler: true
        imageBuilder: true
        ebs: true
        albIngress: true
    instanceTypes:
      - c5a.8xlarge

    ami: ami-0f47e499c06cba4fc
    preBootstrapCommands:
      - date > /tmp/eksler.start
      - echo 'changing root password'
      - echo 'root:badpassword' | sudo chpasswd
      - echo 'password changed'
      - sudo yum install -y htop iotop nano screen bind-utils
      - sudo amazon-linux-extras install epel -y
      - sudo yum install -y stress
      - date > /tmp/eksler.done
    overrideBootstrapCommand: |
      #!/bin/bash
      date > /tmp/eksler.bootstrap.start
      /etc/eks/bootstrap.sh north-1 |& tee /tmp/eksler.bootstrap.output
      echo 0 > /tmp/eksler.bootstrap.status
      date > /tmp/eksler.bootstrap.done

Environment:

matti commented 2 years ago

maybe: not related to cluster largeness, but 64core and 96core instances seem to fail to register even if they are nitro (#1935) and non-metal (#1929)

matti commented 2 years ago

confirmed! Now when I used only 16cpu, 32cpu and 48cpu instances this didn't happen and I was able to scale to 675 instances and run 15788 pods with request.cpu=1, so ~15788 cores until my quota ran out.

jayanthvn commented 2 years ago

@Matti thanks for checking. Can you please share the logs bundle on the failed node. You can email it to - k8s-awscni-triage@amazon.com. Also just to confirm aws-node never recovers?

achevuru commented 2 years ago

@matti Based on the below logs you shared, VPC CNI isn't able to talk to API Server and this is preventing it from initializing correctly. Until CNI can validate API Server connectivity, it will not copy over the CNI config file to /etc/cni/net.d and CNI binaries to /opt/cni/bin resulting in the kubelet error messages you saw regarding Networking not being initialized.

{"level":"info","ts":"2022-03-20T21:41:58.260Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-20T21:41:58.260Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}

We need to check why API Server connectivity is broken for these instances. Also, kubelet isn't able to obtain the node info. kubelet will derive the node Info and validates that the nodeIP provided to it via it's config file indeed belongs to the current instance. Looks like it is not able to do that..permission issues?

21:34:43.175678    4868 kubelet.go:2294] "Error getting node" err="node \"ip-192-168-91-143.eu-north-1.compute.internal\" not found"
jayanthvn commented 2 years ago

Also did kube-proxy start up fine on this node?

matti commented 2 years ago

Looks like it is not able to do that..permission issues?

@achevuru no idea how it could be permission issue since other nodepools work fine, the only thing that is different is that the problematic nodepools use instances of 64cpu/128gb and 96cpu/192gb that ec2-instance-selector returns here: https://github.com/matti/eksler/blob/e4eba9f1b2bbb5d46930ceaa7d0cdeb520777966/bin/eksler#L348-L357

Also did kube-proxy start up fine on this node?

@jayanthvn in my opinion it did, how can I be sure?

Can you please share the logs bundle on the failed node.

I'll post logs maybe tomorrow when I have time again to try this out. Meanwhile I'm pretty sure that you can also replicate this if you start similar clusters 1-3 times and it should happen. eu-north-1 is the region I'm using.

matti commented 2 years ago

also, I'm constantly creating/deleting clusters and when I use only 8,16,32,48 core instances I have 0 problems.

achevuru commented 2 years ago

@matti If you just create a small Nodegroup with either 64cpu/128gb and 96cpu/192gb instances, does it fail? Or do you see these failures only when you scale beyond a certain limit?

achevuru commented 2 years ago

@matti Tried out IPv6 Node groups with both 64 vcpu (c5a.16xlarge) and 96 vcpu (c5a.24xlarge) instances and they came up fine for me. Pod scheduling worked as intended as well. We might've to check both kubelet and aws-node(CNI) logs in your case to see what might be contributing to the problem.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] commented 2 years ago

Issue closed due to inactivity.