NVIDIA / cloud-native-stack

Run cloud native workloads on NVIDIA GPUs
Apache License 2.0
118 stars 47 forks source link

Not able to connect to Kubernetes Cluster on System Restart #37

Closed jimittmodi closed 1 year ago

jimittmodi commented 1 year ago

Getting issue in connecting to the Kubernetes cluster on system restart.

So, on running the playbook again, I am getting the following issue.

TASK [Iniitialize the Kubernetes cluster using kubeadm and containerd for Cloud Native Core 6.2] *******************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--pod-network-cidr=192.168.32.0/22", "--cri-socket=/run/containerd/containerd.sock", "--kubernetes-version=v1.23.7", "--image-repository=k8s.gcr.io"], "delta": "0:00:06.605533", "end": "2023-02-23 07:11:48.306510", "msg": "non-zero return code", "rc": 1, "start": "2023-02-23 07:11:41.700977", "stderr": "error execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.23.7: output: E0223 07:11:43.840167   21550 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\" image=\"k8s.gcr.io/kube-apiserver:v1.23.7\"\ntime=\"2023-02-23T07:11:43Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.23.7: output: E0223 07:11:44.604162   21601 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\" image=\"k8s.gcr.io/kube-controller-manager:v1.23.7\"\ntime=\"2023-02-23T07:11:44Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.23.7: output: E0223 07:11:45.351086   21666 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\" image=\"k8s.gcr.io/kube-scheduler:v1.23.7\"\ntime=\"2023-02-23T07:11:45Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.23.7: output: E0223 07:11:46.097771   21716 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\" image=\"k8s.gcr.io/kube-proxy:v1.23.7\"\ntime=\"2023-02-23T07:11:46Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.6: output: E0223 07:11:46.815906   21774 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": blob sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db: blob not found: not found\" image=\"k8s.gcr.io/pause:3.6\"\ntime=\"2023-02-23T07:11:46Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": blob sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.5.1-0: output: E0223 07:11:47.565614   21831 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\" image=\"k8s.gcr.io/etcd:3.5.1-0\"\ntime=\"2023-02-23T07:11:47Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.6: output: E0223 07:11:48.302979   21889 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\" image=\"k8s.gcr.io/coredns/coredns:v1.8.6\"\ntime=\"2023-02-23T07:11:48Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\"\n, error: exit status 1\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.23.7: output: E0223 07:11:43.840167   21550 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\" image=\"k8s.gcr.io/kube-apiserver:v1.23.7\"", "time=\"2023-02-23T07:11:43Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.23.7: output: E0223 07:11:44.604162   21601 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\" image=\"k8s.gcr.io/kube-controller-manager:v1.23.7\"", "time=\"2023-02-23T07:11:44Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.23.7: output: E0223 07:11:45.351086   21666 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\" image=\"k8s.gcr.io/kube-scheduler:v1.23.7\"", "time=\"2023-02-23T07:11:45Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.23.7: output: E0223 07:11:46.097771   21716 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\" image=\"k8s.gcr.io/kube-proxy:v1.23.7\"", "time=\"2023-02-23T07:11:46Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.6: output: E0223 07:11:46.815906   21774 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": blob sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/3d380ca8864549e7e failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\" image=\"k8s.gcr.io/etcd:3.5.1-0\"", "time=\"2023-02-23T07:11:47Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.6: output: E0223 07:11:48.302979   21889 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\" image=\"k8s.gcr.io/coredns/coredns:v1.8.6\"", "time=\"2023-02-23T07:11:48Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\"", ", error: exit status 1", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.23.7\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "stdout_lines": ["[init] Using Kubernetes version: v1.23.7", "[preflight] Running pre-flight checks", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'"]}

Main concern is to know the reason behind not able to connect to the Cluster and the resolution to that.

angudadevops commented 1 year ago

@jimittmodi this could be happened if the IP of the system has been changed, did you get a chance to check that ?

If that's the case, we suggested to trigger the uninstall and then install again could help. Please let us know

Thanks Anurag G

jimittmodi commented 1 year ago

Hi @angudadevops, We are running it on Azure VM and the Internal and External IP, both are not changed on restart. I also tried the uninstall option and then again installing it. But it still stays the same.

Thanks Jimit M

angudadevops commented 1 year ago

@jimittmodi got it, looks like CNS didn't installed properly as it's failed to connect k8s.gcr.io to pull the images. Currently CNS is not implemented for Azure. I guess you might need to adjust the security groups to pull the images.

Hope this might help

https://github.com/Uninett/azure/blob/master/modules/kubernetes.md

Thanks Anurag G

angudadevops commented 1 year ago

As this is not specific to Cloud Native Stack, closing this issue.