homework36 commented 1 month ago

follow up of #1145 I managed to launch rodan, but with docker compose instead of swarm. Here is what's be done:

launch instance (prod_Rodan_slim) with new vGPU flavor and set up drivers etc as in #1140
clone the rodan git repo to ../srv/webapps/
modified scripts/production.env
modified rodan-client/config/configuration.json
used the same production.yml by running docker-compose -f production.yml up
modified IP for nginx conf with ansible (my ansible is still broken so @dchiller did it manually)

I'm able to log in with my old account (and I went through "forget pwd"). I tried one of the workflows and finished it without problem. It seems that we will be able to use rodan2.simssa.ca at least temporarily. However, since it is not swarm, if anything happens, someone (probably I myself) would need to manually restart containers on the VM.

Remaining issues:

figure out why docker swarm cannot work
1145 lists some error messages in swarm. The main problem is some containers get timeout errors waiting for others. For example, nginx container gets timeout for iipsrv. However, in compose up mode, when starting up, rodan-main container exits with timeout error and containers depending on rodan-main get unhealthy checks, but after all other containers are up, I restarted rodan-main which solves the problem. _Currently I'm having another VM (prod_Rodan_u22) with the exact same flavor and settings to test for swarm, but it has to be closed at the end of this month when we no longer have extra RAM._ This instance has been deleted and we can no longer launch a new one with the same flavor at this point. Waiting to hear from Compute Canada.
we still don't know if the new GPU (vGPU, actually) drivers will lead to any issues.
fix ansible (will be a separate issue)

Update (23/05):

It is expected that you do not have your old data and workflows any more...
Tested with workflow and folios in #1124 and it failed as expected.
I personally noticed that jobs take longer time to finish.
I tested with Debian 12 and same docker and docker-compose version (as staging rodan) but docker swarm still fails with the same condition. Still not sure why it does not work.
To be able to use vGPU driver, which is not free open to public, the VM has to run only one of the following:

Ubuntu 20, 22
Debian 10, 11
Almalinux 8
CentOS 7

Debian 10 has no docker support, and Ubuntu 22 has docker network issues that I cannot solve quickly. So we are only left with Ubuntu 20 and Debian 11 to see if it's possible to run Rodan with docker swarm and use GPU. I haven't tried Ubuntu 20. Now I'm testing on Debian 11.

So far, we have tested Ubuntu and Debian for docker. I don't think we'll ever use Almalinux or CentOS.

I also speculate that since Compute Canada recommends migration from GPU VMs to vGPU VMs and this flavor is the larger of this kind, we have to frequently try to launch, and if we are lucky enough (like previously), we will be able to launch a new one.

homework36 commented 1 month ago

Based on service logs, rodan_postgres is not working properly and therefore rodan_main cannot port from it and all other services (containers in docker compose) depending on rodan_main cannot work, including nginx and other celery services. The message is

no suitable node (insufficient resources on 1 node)

I think it is because we now have a smaller server, and the original production.yml requires

Total CPU Reservation: 0.5 + 1 + 6 + 6 + 2 + 1 + 4 + 1 = 21.5 CPUs
Total Memory Reservation: 1G + 4G + 12G + 12G + 45G + 2G + 10G + 2G = 88G

and we used to have 24 CPUs and 112G RAM for the rodan2 instance but now we are going to smaller VMs. However, this new VM with Debian 12 is indeed a small one. Since we cannot open a new VM with the same flavor, we will have to test on a VM with a different flavor. The plan is to adjust CPU and memory reservation and limit and try again.

Update: lowering service resources seems to work. With this "tiny" VM (4 vCPUs and 22G RAM), I am able to get this:

I will test and see the best resource distribution for the existing VM on a larger VM (different from the current running prod server.)

1145 can now be closed. It looks like OS and docker version have nothing to do with the docker swarm failure we had before.

rodan_gpu-celery does require a GPU. So, it is highly possible that swarm is not able to launch this service because the head node cannot find an available GPU. (Debian 12 cannot install the vGPU driver from Arbutus, so I will test on a different server later.)

homework36 commented 1 month ago

Other than automatically terminating and restarting services (containers), another difference I found out between the two docker modes is docker swarm checks resource conditions for each service while docker compose doesn't. This is why I can do docker compose but not docker swarm for the same production.yml. It is not a problem due to missing connections, etc., but that docker swarm does not start the service if the resource condition is not met. (Lesson learned: do not spend days reading container logs and trying to debug TCP ports.)

homework36 commented 1 month ago

Set up nvidia runtime for docker following guide here. Prereq: (1) NVIDIA Container Toolkit; (2) Docker. Steps:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
run docker info and verify docker runtime has nvidia
```
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
```
Warning: steps from here are based on practice as there's no related official guide.
in /etc/docker/daemon.json make sure it has full path like
```
"path": "/usr/bin/nvidia-container-runtime"
```

restart daemon and docker

systemctl daemon-reload
systemctl restart docker

make sure swarm node has the correct label

docker node ls
docker node inspect [node id] --format "{{.Spec.Labels}}"
docker node update --label-add queue=GPU [node id]

Conclusion: GPU and vGPU work the same for docker as long as NVIDIA Container Toolkit is properly installed.

This whole procedure should be able to work for Debian 11. Might test for Ubuntu 20.

homework36 commented 1 month ago

Tested: ~~Ubuntu 22 is incompatible with GPU runtime and our docker compose yml.~~ Successfully set up GPU runtime for Ubuntu 22. For docker compose, need to restart py3-celery and gpu-celery manually. Docker swarm still has redis timeout network error.

Ubuntu 20 is prone to the same type of errors. There's no difference between Ubuntu 20 and 22. Based on this we can keep debugging for the larger vGPU running Ubuntu 22 since we cannot launch a new one at all.

DDMAL / Rodan

[new Rodan prod] rodan2.simssa.ca is back but docker swarm doesn't work #1149

1145 can now be closed. It looks like OS and docker version have nothing to do with the docker swarm failure we had before.