DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

[new Rodan prod] rodan2.simssa.ca is back but docker swarm doesn't work #1149

Closed homework36 closed 1 month ago

homework36 commented 1 month ago

follow up of #1145 I managed to launch rodan, but with docker compose instead of swarm. Here is what's be done:

  1. launch instance (prod_Rodan_slim) with new vGPU flavor and set up drivers etc as in #1140
  2. clone the rodan git repo to ../srv/webapps/
  3. modified scripts/production.env
  4. modified rodan-client/config/configuration.json
  5. used the same production.yml by running docker-compose -f production.yml up
  6. modified IP for nginx conf with ansible (my ansible is still broken so @dchiller did it manually)

I'm able to log in with my old account (and I went through "forget pwd"). I tried one of the workflows and finished it without problem. It seems that we will be able to use rodan2.simssa.ca at least temporarily. However, since it is not swarm, if anything happens, someone (probably I myself) would need to manually restart containers on the VM.

Remaining issues:

Update (23/05):

  1. It is expected that you do not have your old data and workflows any more...
  2. Tested with workflow and folios in #1124 and it failed as expected.
  3. I personally noticed that jobs take longer time to finish.
  4. I tested with Debian 12 and same docker and docker-compose version (as staging rodan) but docker swarm still fails with the same condition. Still not sure why it does not work. Screenshot 2024-05-23 at 5 05 25 PM
  5. To be able to use vGPU driver, which is not free open to public, the VM has to run only one of the following:

Debian 10 has no docker support, and Ubuntu 22 has docker network issues that I cannot solve quickly. So we are only left with Ubuntu 20 and Debian 11 to see if it's possible to run Rodan with docker swarm and use GPU. I haven't tried Ubuntu 20. Now I'm testing on Debian 11.

So far, we have tested Ubuntu and Debian for docker. I don't think we'll ever use Almalinux or CentOS.

  1. I also speculate that since Compute Canada recommends migration from GPU VMs to vGPU VMs and this flavor is the larger of this kind, we have to frequently try to launch, and if we are lucky enough (like previously), we will be able to launch a new one.
homework36 commented 1 month ago

Based on service logs, rodan_postgres is not working properly and therefore rodan_main cannot port from it and all other services (containers in docker compose) depending on rodan_main cannot work, including nginx and other celery services. The message is

no suitable node (insufficient resources on 1 node)

I think it is because we now have a smaller server, and the original production.yml requires

Total CPU Reservation: 0.5 + 1 + 6 + 6 + 2 + 1 + 4 + 1 = 21.5 CPUs
Total Memory Reservation: 1G + 4G + 12G + 12G + 45G + 2G + 10G + 2G = 88G

and we used to have 24 CPUs and 112G RAM for the rodan2 instance but now we are going to smaller VMs. However, this new VM with Debian 12 is indeed a small one. Since we cannot open a new VM with the same flavor, we will have to test on a VM with a different flavor. The plan is to adjust CPU and memory reservation and limit and try again.

Update: lowering service resources seems to work. With this "tiny" VM (4 vCPUs and 22G RAM), I am able to get this:

Screenshot 2024-05-23 at 10 16 27 PM

I will test and see the best resource distribution for the existing VM on a larger VM (different from the current running prod server.)

1145 can now be closed. It looks like OS and docker version have nothing to do with the docker swarm failure we had before.

rodan_gpu-celery does require a GPU. So, it is highly possible that swarm is not able to launch this service because the head node cannot find an available GPU. (Debian 12 cannot install the vGPU driver from Arbutus, so I will test on a different server later.)

homework36 commented 1 month ago

Other than automatically terminating and restarting services (containers), another difference I found out between the two docker modes is docker swarm checks resource conditions for each service while docker compose doesn't. This is why I can do docker compose but not docker swarm for the same production.yml. It is not a problem due to missing connections, etc., but that docker swarm does not start the service if the resource condition is not met. (Lesson learned: do not spend days reading container logs and trying to debug TCP ports.)

homework36 commented 1 month ago

Set up nvidia runtime for docker following guide here. Prereq: (1) NVIDIA Container Toolkit; (2) Docker. Steps:

  1. sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
  2. sudo systemctl restart docker
  3. run docker info and verify docker runtime has nvidia
    Runtimes: io.containerd.runc.v2 nvidia runc
    Default Runtime: nvidia

    Warning: steps from here are based on practice as there's no related official guide.

  4. in /etc/docker/daemon.json make sure it has full path like
    "path": "/usr/bin/nvidia-container-runtime"
  5. restart daemon and docker
    systemctl daemon-reload
    systemctl restart docker
  6. make sure swarm node has the correct label
    docker node ls
    docker node inspect [node id] --format "{{.Spec.Labels}}"
    docker node update --label-add queue=GPU [node id]
    Screenshot 2024-05-24 at 10 08 20 PM

Conclusion: GPU and vGPU work the same for docker as long as NVIDIA Container Toolkit is properly installed.

This whole procedure should be able to work for Debian 11. Might test for Ubuntu 20.

homework36 commented 1 month ago

Tested: Ubuntu 22 is incompatible with GPU runtime and our docker compose yml. Successfully set up GPU runtime for Ubuntu 22. For docker compose, need to restart py3-celery and gpu-celery manually. Docker swarm still has redis timeout network error.

Ubuntu 20 is prone to the same type of errors. There's no difference between Ubuntu 20 and 22. Based on this we can keep debugging for the larger vGPU running Ubuntu 22 since we cannot launch a new one at all.