charmed-kubernetes / kubernetes-docs

This repository contains the development version of docs for Charmed Kubernetes
8 stars 52 forks source link

The installation of Charmed kubernetes with GPU as local couldn't ended #830

Open iiot-architect opened 6 months ago

iiot-architect commented 6 months ago

I'm trying the installation of Charmed Kubernetes with NVIDIA GPU on an Amazon EC2 instance(g5.xlarge) as local:

sudo snap install juju --classic
juju add-credential localhost
juju clouds
juju bootstrap
juju add-model k8s
juju deploy charmed-kubernetes
juju config calico ignore-loose-rpf=true

However I seem that the process isn't ended for over 3 hours:

ubuntu@ip-10-10-1-38:~$ juju status
Model  Controller           Cloud/Region         Version  SLA          Timestamp
k8s    localhost-localhost  localhost/localhost  3.3.1    unsupported  09:31:30Z

App                       Version  Status   Scale  Charm                     Channel      Rev  Exposed  Message
calico                    3.21.4   active       5  calico                    1.27/stable   87  no       Calico is active
containerd                         blocked      5  containerd                1.27/stable   65  no       containerd resource binary containerd-stress failed a version check
easyrsa                   3.0.1    active       1  easyrsa                   1.27/stable   42  no       Certificate Authority connected.
etcd                      3.4.22   active       3  etcd                      1.27/stable  742  no       Healthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     1.27/stable   79  yes      Loadbalancer ready.
kubernetes-control-plane  1.27.10  waiting      2  kubernetes-control-plane  1.27/stable  274  no       Waiting for 4 kube-system pods to start
kubernetes-worker         1.27.10  waiting      3  kubernetes-worker         1.27/stable  112  yes      Waiting for kubelet to start.

Unit                         Workload  Agent  Machine  Public address  Ports         Message
easyrsa/0*                   active    idle   0        10.132.163.17                 Certificate Authority connected.
etcd/0*                      active    idle   1        10.132.163.184  2379/tcp      Healthy with 3 known peers
etcd/1                       active    idle   2        10.132.163.135  2379/tcp      Healthy with 3 known peers
etcd/2                       active    idle   3        10.132.163.233  2379/tcp      Healthy with 3 known peers
kubeapi-load-balancer/0*     active    idle   4        10.132.163.33   443,6443/tcp  Loadbalancer ready.
kubernetes-control-plane/0   waiting   idle   5        10.132.163.119  6443/tcp      Waiting for 4 kube-system pods to start
  calico/3                   active    idle            10.132.163.119                Calico is active
  containerd/3               blocked   idle            10.132.163.119                containerd resource binary containerd-stress failed a version check
kubernetes-control-plane/1*  waiting   idle   6        10.132.163.146  6443/tcp      Waiting for 4 kube-system pods to start
  calico/4                   active    idle            10.132.163.146                Calico is active
  containerd/4               blocked   idle            10.132.163.146                containerd resource binary containerd-stress failed a version check
kubernetes-worker/0*         waiting   idle   7        10.132.163.121  80,443/tcp    Waiting for kubelet to start.
  calico/2                   active    idle            10.132.163.121                Calico is active
  containerd/2               blocked   idle            10.132.163.121                containerd resource binary containerd-stress failed a version check
kubernetes-worker/1          waiting   idle   8        10.132.163.243  80,443/tcp    Waiting for kubelet to start.
  calico/0*                  active    idle            10.132.163.243                Calico is active
  containerd/0*              blocked   idle            10.132.163.243                containerd resource binary containerd-stress failed a version check
kubernetes-worker/2          waiting   idle   9        10.132.163.140  80,443/tcp    Waiting for kubelet to start.
  calico/1                   active    idle            10.132.163.140                Calico is active
  containerd/1               blocked   idle            10.132.163.140                containerd resource binary containerd-stress failed a version check

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.132.163.17   juju-84dc78-0  ubuntu@22.04      Running
1        started  10.132.163.184  juju-84dc78-1  ubuntu@22.04      Running
2        started  10.132.163.135  juju-84dc78-2  ubuntu@22.04      Running
3        started  10.132.163.233  juju-84dc78-3  ubuntu@22.04      Running
4        started  10.132.163.33   juju-84dc78-4  ubuntu@22.04      Running
5        started  10.132.163.119  juju-84dc78-5  ubuntu@22.04      Running
6        started  10.132.163.146  juju-84dc78-6  ubuntu@22.04      Running
7        started  10.132.163.121  juju-84dc78-7  ubuntu@22.04      Running
8        started  10.132.163.243  juju-84dc78-8  ubuntu@22.04      Running
9        started  10.132.163.140  juju-84dc78-9  ubuntu@22.04      Running

kubernetes-control-plane is repeatedly showing the message between 'Restarting snap.kubelet.daemon service' and 'Waiting for 4 kube-system pods to start'. Also containerd is repeatedly showing the message between 'Unpacking containerd resource' and 'containerd resource binary containerd-stress failed a version check' as well.

The instance was installed the following software before the installation process:

NVIDIA GPU Driver: https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb NVIDIA CUDA: https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb

And I tried version 1.28/stable and 1.27/stable but the symptoms was almost same. How can I improve this problem?

evilnick commented 6 months ago

Hi, sorry you are having an issue, it does look like contained is getting stuck in a loop preventing the nodes from coming up., which could I guess be caused by the GPU driver. The kubernetes-worker charm automatically downloads the required drivers which may be causing the issue if it has been pre-installed.

Perhaps @kwmonroe may have some insights here

In the meantime it may be worth trying to set contained to ignore the GPU to confirm that is the issue:

juju config contained gpu_driver="none"

or trying again without pre-installing the drivers.

iiot-architect commented 6 months ago

Well, if the installation is without installing GPU driver and CUDA, the process is ended normally. However the message 'without GPU' is showed. After I confirmed it, I'm trying the installation with the drivers now. Note: the next step, I'll try LLM on the Kubernetes cluster. so GPU is needed basically.

`ubuntu@ip-10-10-11-82:~$ juju status Model Controller Cloud/Region Version SLA Timestamp k8s localhost-localhost localhost/localhost 3.4.0 unsupported 09:54:24Z

App Version Status Scale Charm Channel Rev Exposed Message calico 3.25.1 active 5 calico 1.28/stable 101 no Ready containerd 1.6.8 active 5 containerd 1.28/stable 73 no Container runtime available easyrsa 3.0.1 active 1 easyrsa 1.28/stable 48 no Certificate Authority connected. etcd 3.4.22 active 3 etcd 1.28/stable 748 no Healthy with 3 known peers kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer 1.28/stable 84 yes Loadbalancer ready. kubernetes-control-plane 1.28.7 active 2 kubernetes-control-plane 1.28/stable 321 no Kubernetes control-plane running. kubernetes-worker 1.28.7 active 3 kubernetes-worker 1.28/stable 134 yes Kubernetes worker running (without gpu support).

Unit Workload Agent Machine Public address Ports Message easyrsa/0 active idle 0 10.32.96.191 Certificate Authority connected. etcd/0 active idle 1 10.32.96.166 2379/tcp Healthy with 3 known peers etcd/1 active idle 2 10.32.96.78 2379/tcp Healthy with 3 known peers etcd/2 active idle 3 10.32.96.5 2379/tcp Healthy with 3 known peers kubeapi-load-balancer/0 active idle 4 10.32.96.141 443,6443/tcp Loadbalancer ready. kubernetes-control-plane/0 active idle 5 10.32.96.126 6443/tcp Kubernetes control-plane running. calico/4 active idle 10.32.96.126 Ready containerd/4 active idle 10.32.96.126 Container runtime available kubernetes-control-plane/1 active idle 6 10.32.96.24 6443/tcp Kubernetes control-plane running. calico/3 active idle 10.32.96.24 Ready containerd/3 active idle 10.32.96.24 Container runtime available kubernetes-worker/0 active idle 7 10.32.96.187 80,443/tcp Kubernetes worker running (without gpu support). calico/2 active idle 10.32.96.187 Ready containerd/2 active idle 10.32.96.187 Container runtime available kubernetes-worker/1 active idle 8 10.32.96.97 80,443/tcp Kubernetes worker running (without gpu support). calico/0 active idle 10.32.96.97 Ready containerd/0* active idle 10.32.96.97 Container runtime available kubernetes-worker/2 active idle 9 10.32.96.169 80,443/tcp Kubernetes worker running (without gpu support). calico/1 active idle 10.32.96.169 Ready containerd/1 active idle 10.32.96.169 Container runtime available

Machine State Address Inst id Base AZ Message 0 started 10.32.96.191 juju-5e37ba-0 ubuntu@22.04 Running 1 started 10.32.96.166 juju-5e37ba-1 ubuntu@22.04 Running 2 started 10.32.96.78 juju-5e37ba-2 ubuntu@22.04 Running 3 started 10.32.96.5 juju-5e37ba-3 ubuntu@22.04 Running 4 started 10.32.96.141 juju-5e37ba-4 ubuntu@22.04 Running 5 started 10.32.96.126 juju-5e37ba-5 ubuntu@22.04 Running 6 started 10.32.96.24 juju-5e37ba-6 ubuntu@22.04 Running 7 started 10.32.96.187 juju-5e37ba-7 ubuntu@22.04 Running 8 started 10.32.96.97 juju-5e37ba-8 ubuntu@22.04 Running 9 started 10.32.96.169 juju-5e37ba-9 ubuntu@22.04 Running ubuntu@ip-10-10-11-82:~$ `

kwmonroe commented 6 months ago

@iiot-architect can you provide some details on your instance? i just deployed a g5.xlarge and got:

ubuntu@ip-172-31-20-96:~$ nproc
4

ubuntu@ip-172-31-20-96:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       224Mi        14Gi       0.0Ki       834Mi        14Gi
Swap:             0B          0B          0B

ubuntu@ip-172-31-20-96:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       7.6G  2.0G  5.7G  26% /

the charmed k8s bundle is pretty heavy weight -- especially deployed to lxd. i doubt 4 cores and 16g ram will be enough, but i'm positive 8G root filesystem won't be :)

is it possible you've run out of disk space on your instance?

iiot-architect commented 6 months ago

Dear kwmonroe.

No, the disk space is no problem. Since I has allocated 200GB capacity to the instance as the root storage volume of gp3.

iiot-architect commented 6 months ago

According to the official blog, I think that NVIDIA Driver and CUDA should be installed to the host in advance:

https://ubuntu.com/blog/nvidia-cuda-inside-a-lxd-container

iiot-architect commented 6 months ago

I seem that the configuration about containerd isn't effective:

ubuntu@ip-10-10-8-132:~$ juju status
Model  Controller           Cloud/Region         Version  SLA          Timestamp
k8s    localhost-localhost  localhost/localhost  3.4.0    unsupported  02:13:35Z

App                       Version  Status   Scale  Charm                     Channel      Rev  Exposed  Message
calico                             waiting      5  calico                    1.28/stable  101  no       Configuring Calico
containerd                         blocked      5  containerd                1.28/stable   73  no       containerd resource binary containerd-stress failed a version check
easyrsa                   3.0.1    active       1  easyrsa                   1.28/stable   48  no       Certificate Authority connected.
etcd                      3.4.22   active       3  etcd                      1.28/stable  748  no       Healthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     1.28/stable   84  yes      Loadbalancer ready.
kubernetes-control-plane  1.28.6   waiting      2  kubernetes-control-plane  1.28/stable  321  no       Waiting for 4 kube-system pods to start
kubernetes-worker         1.28.6   waiting      3  kubernetes-worker         1.28/stable  134  yes      Waiting for kubelet to start.

Unit                         Workload  Agent  Machine  Public address  Ports         Message
easyrsa/0*                   active    idle   0        10.215.33.158                 Certificate Authority connected.
etcd/0*                      active    idle   1        10.215.33.60    2379/tcp      Healthy with 3 known peers
etcd/1                       active    idle   2        10.215.33.190   2379/tcp      Healthy with 3 known peers
etcd/2                       active    idle   3        10.215.33.33    2379/tcp      Healthy with 3 known peers
kubeapi-load-balancer/0*     active    idle   4        10.215.33.103   443,6443/tcp  Loadbalancer ready.
kubernetes-control-plane/0*  waiting   idle   5        10.215.33.109   6443/tcp      Waiting for 4 kube-system pods to start
  calico/4                   waiting   idle            10.215.33.109                 Configuring Calico
  containerd/4               blocked   idle            10.215.33.109                 containerd resource binary containerd-stress failed a version check
kubernetes-control-plane/1   waiting   idle   6        10.215.33.156   6443/tcp      Waiting for 4 kube-system pods to start
  calico/3                   waiting   idle            10.215.33.156                 Configuring Calico
  containerd/3               blocked   idle            10.215.33.156                 containerd resource binary containerd-stress failed a version check
kubernetes-worker/0*         waiting   idle   7        10.215.33.97    80,443/tcp    Waiting for kubelet to start.
  calico/2                   waiting   idle            10.215.33.97                  Configuring Calico
  containerd/2               blocked   idle            10.215.33.97                  containerd resource binary containerd-stress failed a version check
kubernetes-worker/1          waiting   idle   8        10.215.33.20    80,443/tcp    Waiting for kubelet to start.
  calico/0*                  waiting   idle            10.215.33.20                  Configuring Calico
  containerd/0*              blocked   idle            10.215.33.20                  containerd resource binary containerd-stress failed a version check
kubernetes-worker/2          waiting   idle   9        10.215.33.96    80,443/tcp    Waiting for kubelet to start.
  calico/1                   waiting   idle            10.215.33.96                  Configuring Calico
  containerd/1               blocked   idle            10.215.33.96                  containerd resource binary containerd-stress failed a version check

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.215.33.158  juju-7d866d-0  ubuntu@22.04      Running
1        started  10.215.33.60   juju-7d866d-1  ubuntu@22.04      Running
2        started  10.215.33.190  juju-7d866d-2  ubuntu@22.04      Running
3        started  10.215.33.33   juju-7d866d-3  ubuntu@22.04      Running
4        started  10.215.33.103  juju-7d866d-4  ubuntu@22.04      Running
5        started  10.215.33.109  juju-7d866d-5  ubuntu@22.04      Running
6        started  10.215.33.156  juju-7d866d-6  ubuntu@22.04      Running
7        started  10.215.33.97   juju-7d866d-7  ubuntu@22.04      Running
8        started  10.215.33.20   juju-7d866d-8  ubuntu@22.04      Running
9        started  10.215.33.96   juju-7d866d-9  ubuntu@22.04      Running
ubuntu@ip-10-10-8-132:~$ **juju config containerd gpu_driver="none"
WARNING the configuration setting "gpu_driver" already has the value "none"**
evilnick commented 6 months ago

According to the official blog, I think that NVIDIA Driver and CUDA should be installed to the host in advance:

https://ubuntu.com/blog/nvidia-cuda-inside-a-lxd-container

That blog post is 6 years old so I'm not sure how much of it is reliable any more. If the containerd is already set to none, try setting it to "nvidia" instead. Though if it is set to none and failing then maybe the issue isn't the driver after all

iiot-architect commented 6 months ago

Dear evilnick

If the containerd is already set to none, try setting it to "nvidia" instead.

Well, I seem that it's irrelevant. Since I tried it but the result was almost same. Also I added GPU to each Lxds but the result was same with the case of the driver installation in advance.

lxc config device add [Name of Lxd] gpu gpu

In addition, I changed the instance type from g5.xlarge to g4ad.2xlarge with the advanced installation of the driver but not almost changed the result.

iiot-architect commented 6 months ago

Dear kwmonroe.

Thanks for your help. I tried again based on your shared new document.

sudo apt update sudo apt -y full-upgrade && sudo reboot -f wget https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb sudo cp /var/nvidia-driver-local-repo-ubuntu2204-535.154.05/nvidia-driver-local-91B8C5A2-keyring.gpg /usr/share/keyrings/ sudo apt install -y build-essential wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run sudo sh cuda_12.3.2_545.23.08_linux.run --silent echo 'export PATH=$PATH:/usr/local/cuda' >> ~/.bashrc source ~/.bashrc nvidia-smi https://deploy-preview-832--cdk-next.netlify.app/kubernetes/docs/install-local

Sure, the configuration process was ended normally. However the workers are without GPU support:

ubuntu@ip-10-10-4-228:~$ juju status Model Controller Cloud/Region Version SLA Timestamp ck8s localhost-localhost localhost/localhost 3.4.0 unsupported 09:23:26Z

App Version Status Scale Charm Channel Rev Exposed Message calico 3.25.1 active 5 calico stable 101 no Ready containerd 1.7.2 active 5 containerd stable 73 no Container runtime available easyrsa 3.0.1 active 1 easyrsa stable 48 no Certificate Authority connected. etcd 3.4.22 active 3 etcd stable 748 no Healthy with 3 known peers kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer stable 84 yes Loadbalancer ready. kubernetes-control-plane 1.28.7 active 2 kubernetes-control-plane stable 321 no Kubernetes control-plane running. kubernetes-worker 1.28.7 active 3 kubernetes-worker stable 134 yes Kubernetes worker running (without gpu support).

And I added GPU to each Lxds of the workers but don't changed:

lxc config device add juju-4e969a-7 gpu gpu lxc config device add juju-4e969a-8 gpu gpu lxc config device add juju-4e969a-9 gpu gpu lxc restart juju-4e969a-7 lxc restart juju-4e969a-8 lxc restart juju-4e969a-9