[rodan migration] set up (v)GPU drivers and configure docker environment

homework36 commented 1 month ago

As part of issue #1137, we need to set up the new VM to be able to run vGPUs. This issue is a record of what I have tried (on 2024-05-13).

There is instruction on wiki but steps listed on wiki will not lead to successful nvidia-smi message. Given that it was written on 2022-06-01, errors and failures are somewhat expected.

I followed this guide from Compute Canada. This instruction applied to VMs with g1-8gb-c4-22gb and g1-16gb-c8-40gb (we are using this one for the new Rodan production VM) flavours only.

With Debian system, it is very possible that we need to install kernel headers such as

sudo apt-get install linux-headers-$(uname -r)

These steps are (might need sudo)

apt-get update && apt-get -y dist-upgrade && reboot
wget http://repo.arbutus.cloud.computecanada.ca/pulp/deb/ubuntu22/pool/main/arbutus-cloud-repo_0.2_all.deb
dpkg -i arbutus-cloud-repo_0.2_all.deb
apt-get update && apt-get -y install nvidia-vgpu-kmod nvidia-vgpu-tools nvidia-vgpu-gridd

The last step instructed by Compute Canada installs the following vGPU packages which I could not find detailed documentation:

nvidia-vgpu-kmod: Nvidia vGPU Kernel modules.
nvidia-vgpu-tools: Nvidia tools to communicate with the driver.
nvidia-vgpu-gridd: Nvidia-gridd to request a license from the license server.

However, after these steps, nvidia-smi outputs that this new VM is running on

GRID V100D-16C
Driver Version: 470.239.06
CUDA Version: 11.4.

and the kernel driver can successfully communicate with the Nvidia physical GPU.

Problems found:

nvidia-container-runtime cannot be installed via command on wiki (sudo apt install nvidia-container-runtime) with error message: E: Unable to locate package nvidia-container-runtime The website documentation says that nvidia-container-runtime has been superseded by NVIDIA Container Toolkit. So I followed the official guide here and did
```
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
```
The docker has to be configured later. At this point, I'm not sure what problems I will encounter later. Update: should do sudo apt install nvidia-container-runtime after the previous steps.
The wiki says adding a new deploy key for the new VM to https://github.com/DDMAL/rodan-docker/settings which should be the current rodan repository instead of the old archived rodan-docker (confirmed with @timothydereuse).
The ssh config line IdentityFile ~/. should be IdentityFile ~/.ssh/rodan-docker pointing to the private key instead.
Where should I clone the rodan repo on this VM? I have mounted to the old rodan snapshot volume, too. (Solved: srv/webapp/Rodan)

homework36 commented 1 month ago

Update on 2024-05-14: I just realized that this new (and smaller GPU flavor) is actually vGPU instead of GPU! GPU VMs have flavors

g2-c24-112gb-500
g1-c14-56gb-500
g1-c14-56gb

and these VMs run on Debian 10, which explains why I cannot upgrade to newer Ubuntu directly and why previous commands for GPU drivers failed. The smaller G1 flavors all use vGPUs (g1-8gb-c4-22gb and g1-16gb-c8-40gb). I'm not sure if this change from GPU VM to vGPU VM will lead to potential issues. Currently, the GPU celery container cannot run at all with the make docker swarm command. I was trying to edit the environment variables in gpu-celery/Dockerfile accordingly, but I cannot find the CUDNN version. (Other containers now have problems, too.) I'm not sure if we were aware of this difference when we made the decision to switch to this VM flavor. But more work has to be done now.

fujinaga commented 1 month ago

All I know is that we were never able to use vGPUs before, although we've tried in the past. Are there are no VMs that use GPU (not vGPU) and Ubuntu that we can use?

homework36 commented 1 month ago

All I know is that we were never able to use vGPUs before, although we've tried in the past. Are there are no VMs that use GPU (not vGPU) and Ubuntu that we can use?

Good morning Ich, I just reviewed the email that Wanyi and I wrote in April and g1-c14-56gb (GPU) still seems possible with our lowered RAM allocation. I just deleted the VM with Ubuntu 18.04 and vGPU and created a new one with Ubuntu 22.04 and 1 GPU (g1-c14-56gb) which is the same flavor for the staging Rodan. I will try to deploy Rodan on this new VM first. (I think that was actually the originally plan but somehow there was some misunderstanding of the flavors.)

homework36 commented 1 month ago

@timothydereuse mentions that "some of the older GPU instances are being deprecated soon, and the people at Compute Canada asked us to move over to a vGPU-based instance" in the email. We cannot keep both new VMs (with vGPU and GPU) after the end of this month. I can't get either of them to work right now, but maybe we have to decide which one we want to keep.

fujinaga commented 1 month ago

Do you want to try making the vGPU version to work? Thinking long term, this may be the better solution.

homework36 commented 1 month ago

Do you want to try making the vGPU version to work? Thinking long term, this may be the better solution.

With the newer GPU driver etc, both VMs seem to have the same issue (rodan-main cannot launch properly). I'll work more on the vGPU VM for a few days and see.

homework36 commented 1 month ago

New prod rodan server is now working with vGPU.

DDMAL / Rodan

[rodan migration] set up (v)GPU drivers and configure docker environment #1140