Closed homework36 closed 1 month ago
Based on service logs, rodan_postgres
is not working properly and therefore rodan_main
cannot port from it and all other services (containers in docker compose) depending on rodan_main
cannot work, including nginx
and other celery services. The message is
no suitable node (insufficient resources on 1 node)
I think it is because we now have a smaller server, and the original production.yml
requires
Total CPU Reservation: 0.5 + 1 + 6 + 6 + 2 + 1 + 4 + 1 = 21.5 CPUs
Total Memory Reservation: 1G + 4G + 12G + 12G + 45G + 2G + 10G + 2G = 88G
and we used to have 24 CPUs and 112G RAM for the rodan2 instance but now we are going to smaller VMs. However, this new VM with Debian 12 is indeed a small one. Since we cannot open a new VM with the same flavor, we will have to test on a VM with a different flavor. The plan is to adjust CPU and memory reservation and limit and try again.
Update: lowering service resources seems to work. With this "tiny" VM (4 vCPUs and 22G RAM), I am able to get this:
I will test and see the best resource distribution for the existing VM on a larger VM (different from the current running prod server.)
rodan_gpu-celery
does require a GPU. So, it is highly possible that swarm is not able to launch this service because the head node cannot find an available GPU. (Debian 12 cannot install the vGPU driver from Arbutus, so I will test on a different server later.)
Other than automatically terminating and restarting services (containers), another difference I found out between the two docker modes is docker swarm checks resource conditions for each service while docker compose doesn't. This is why I can do docker compose but not docker swarm for the same production.yml
. It is not a problem due to missing connections, etc., but that docker swarm does not start the service if the resource condition is not met. (Lesson learned: do not spend days reading container logs and trying to debug TCP ports.)
Set up nvidia runtime for docker following guide here. Prereq: (1) NVIDIA Container Toolkit; (2) Docker. Steps:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
docker info
and verify docker runtime has nvidia
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Warning: steps from here are based on practice as there's no related official guide.
/etc/docker/daemon.json
make sure it has full path like
"path": "/usr/bin/nvidia-container-runtime"
systemctl daemon-reload
systemctl restart docker
docker node ls
docker node inspect [node id] --format "{{.Spec.Labels}}"
docker node update --label-add queue=GPU [node id]
Conclusion: GPU and vGPU work the same for docker as long as NVIDIA Container Toolkit is properly installed.
This whole procedure should be able to work for Debian 11. Might test for Ubuntu 20.
Tested: Ubuntu 22 is incompatible with GPU runtime and our docker compose yml. Successfully set up GPU runtime for Ubuntu 22. For docker compose, need to restart py3-celery
and gpu-celery
manually. Docker swarm still has redis
timeout network error.
Ubuntu 20 is prone to the same type of errors. There's no difference between Ubuntu 20 and 22. Based on this we can keep debugging for the larger vGPU running Ubuntu 22 since we cannot launch a new one at all.
follow up of #1145 I managed to launch rodan, but with docker compose instead of swarm. Here is what's be done:
../srv/webapps/
scripts/production.env
rodan-client/config/configuration.json
docker-compose -f production.yml up
I'm able to log in with my old account (and I went through "forget pwd"). I tried one of the workflows and finished it without problem. It seems that we will be able to use rodan2.simssa.ca at least temporarily. However, since it is not swarm, if anything happens, someone (probably I myself) would need to manually restart containers on the VM.
Remaining issues:
1145 lists some error messages in swarm. The main problem is some containers get timeout errors waiting for others. For example,
nginx
container gets timeout foriipsrv
. However, in compose up mode, when starting up,rodan-main
container exits with timeout error and containers depending onrodan-main
get unhealthy checks, but after all other containers are up, I restartedrodan-main
which solves the problem. _Currently I'm having another VM (prod_Rodan_u22) with the exact same flavor and settings to test for swarm, but it has to be closed at the end of this month when we no longer have extra RAM._ This instance has been deleted and we can no longer launch a new one with the same flavor at this point. Waiting to hear from Compute Canada.Update (23/05):
Debian 10 has no docker support, and Ubuntu 22 has docker network issues that I cannot solve quickly. So we are only left with Ubuntu 20 and Debian 11 to see if it's possible to run Rodan with docker swarm and use GPU. I haven't tried Ubuntu 20. Now I'm testing on Debian 11.
So far, we have tested Ubuntu and Debian for docker. I don't think we'll ever use Almalinux or CentOS.