DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

production server with multiple instances #1184

Closed homework36 closed 2 months ago

homework36 commented 3 months ago

Updated on July 10, 2024

Option 1 needs 8 vCPUs, and 40GB RAM, while option 2 needs 20 vCPUs and 38GB RAM. Since we are mainly running short of RAM after the extension, this way we actually have more vCPUs to improve performance of other non-GPU containers. We can also use the remaining 2GB RAM to have a separate instance for data storage.

Based on my experiments, we only need to deploy (and later update) the stack on the instance with the manager node and Docker swarm will handle the rest as long as we correctly join the network and label the worker node.

The only trouble I encountered so far is the worker instance for the GPU container needs to access data stored on non-GPU instance. But I believe this is possible. One common practice is using NFS, which I will try this week and report back.

Update on Jun 27, 2024 We already have one g1-16gb-c8-40gb vGPU instance, but with wrong OS. (Ubuntu 22.04 has wrong DNS resolution in Docker Swarm so we always get redis timeout error, after many trials. Since it will be extremely difficult to launch a new vGPU instance of this flavor and it was purely luck that enabled us to launch this one, I was hoping to make use of this server. However, it cannot work as a worker node. Nor can it be rebuilt with the correct OS.

openstack server rebuild --image 484b2b0c-a9ba-4d8e-b966-5735b5a6f8dc test_rebuild
Attempting to rebuild a volume-backed server using --os-compute-api-version 2.92 or earlier, which will only succeed if the image is identical to the one initially used. This will be an error in a future release.
Image 484b2b0c-a9ba-4d8e-b966-5735b5a6f8dc is unacceptable: Unable to rebuild with a different image for a volume-backed server. (HTTP 400) (Request-ID: req-742195ef-5643-4d96-9d9b-6a1c2977fc23)

As a result, we will just delete this instance because it turns out to be almost useless for us.

"c" designates "compute", "p" designates "persistent", and "g" designates "vGPU". "c" flavor is targeted towards CPU intensive tasks, while "p" flavor is geared towards web servers, data base servers and instances that have a lower CPU or bursty CPU usage profile in general.

However, "c" instances are expensive in RAM. If we go with one g1-8gb-c4-22gb for GPU worker instance, we have around 40GB left. We can only afford c8-30gb-288 among "c" flavors. However, we can still get p16 with RAM options for 16, 24, and 32 GB. I will try with p16-32gb first because we do not want to waste extra resources to prevent Compute Canada from down grading us for the next year.

We now have rodan2 prod back for all tasks except PACO training with GPU. Distributed prod with one "p" instance and one vGPU instance still needs more testing before NFS can be deployed. There is a "broken pipe" for rodan-main with the "p" instance now.

homework36 commented 3 months ago

Update on July 3, 2024 This is with p16-24gb as the manager host on Debian 12. I have tested on a few instances and it looks like we need Docker <= 26 to be able to set up the swarm network successfully. The latest Docker 27 leads to errors here and there. Screenshot 2024-07-03 at 10 11 40 AM To be able to port url to the manager instance ip and run GPU container with vGPU instance, we need to use the following production.yml.

version: "3.4"

services:

  nginx:
    image: "ddmal/nginx:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.5"
          memory: 0.5G
        limits:
          cpus: "0.5"
          memory: 0.5G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "/usr/sbin/service", "nginx", "status"]
      interval: "30s"
      timeout: "10s"
      retries: 10
      start_period: "5m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      TLS: 1
    ports:
      - "80:80"
      - "443:443"
      - "5671:5671"
      - "9002:9002"
    volumes:
      - "resources:/rodan/data"

  rodan-main:
    image: "ddmal/rodan-main:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 3G
        limits:
          cpus: "1"
          memory: 3G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "/usr/bin/curl -H 'User-Agent: docker-healthcheck' http://localhost:8000/api/?format=json || exit 1"]
      interval: "30s"
      timeout: "30s"
      retries: 5
      start_period: "2m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: None
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  rodan-client:
    image: "ddmal/rodan-client:nightly"
    deploy:
      placement:
        constraints:
          - node.role == manager
    volumes:
        - "./rodan-client/config/configuration.json:/client/configuration.json"

  iipsrv:
    image: "ddmal/iipsrv:nightly"
    volumes:
      - "resources:/rodan/data"

  celery:
    image: "ddmal/rodan-main:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.8"
          memory: 2G
        limits:
          cpus: "0.8"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@celery", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      start_period: "1m"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: celery
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  py3-celery:
    image: "ddmal/rodan-python3-celery:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 3G
        limits:
          cpus: "3"
          memory: 3G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@Python3", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: Python3
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  gpu-celery:
    image: "ddmal/rodan-gpu-celery:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 18G
        limits:
          cpus: "3"
          memory: 18G
      placement:
        constraints:
          - node.role == worker
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: GPU
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  redis:
    image: "redis:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 2G
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto

  postgres:
    image: "ddmal/postgres-plpython:v3.0.0"
    deploy:
      replicas: 1
      endpoint_mode: dnsrr
      resources:
        reservations:
          cpus: "2"
          memory: 2G
        limits:
          cpus: "2"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-U", "postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto
    volumes:
      - "pg_data:/var/lib/postgresql/data"
      - "pg_backup:/backups"
    env_file:
      - ./scripts/production.env

  rabbitmq:
    image: "rabbitmq:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 4G
        limits:
          cpus: "1"
          memory: 4G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
      interval: "30s"
      timeout: "3s"
      retries: 3
    environment:
      TZ: America/Toronto
    env_file:
      - ./scripts/production.env

volumes:
  resources:
  pg_backup:
  pg_data:

Essentially we have only gpu-celery on the worker instance. We cannot specify deploy constraint for iipsrv so it might be deployed on the worker instance as well. All other instances should be on the manager node with sufficient vGPU and RAM. This server is up on rodan.simssa.ca. The login popup window on homepage seems gone as well given larger cpu and memory for each container. Do not use Docker version >= 27. Only use Docker <= 26.

Remaining issues:

homework36 commented 2 months ago

The remaining issue is in #1181.

To set up NFS, I followed this guide and did everything except ufw (firewall), which we might set up later if needed. I directly mounted /var/lib/docker/volumes/ on both instances and restarted gpu container. This is what's inside /etc/exports on the manager instance.

/var/lib/docker/volumes [worker node public ip](rw,sync,no_subtree_check,no_root_squash)

(Need sudo systemctl restart nfs-kernel-server after editing.)

homework36 commented 2 months ago

Good practice to detach and delete a worker instance (all on the worker instance):

  1. leave the docker swarm network
    docker swarm leave --force
  2. unmount the nfs directory
    sudo umount /var/lib/docker/volumes/