DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

[new Rodan prod] rodan-main fails and celery containers won't start #1145

Closed homework36 closed 1 month ago

homework36 commented 1 month ago

At first I thought this is an Nginx thing as in #1142, but starting Nginx manually inside the container I got [emerg] host not found in upstream and log wait-for-app: timeout occurred after waiting 15 seconds for iipsrv:9003. Checked again and I found that rodan-main would fail after several minutes (even when I set the docker container to be idle) and containers for celery jobs did not launch at all. This happens to both new VMs (with GPU and vGPU). My speculation is that celery and Nginx all depend on rodan-main, which is not working.

docker logs for rodan-main indicates that the container stops at this line.*

wait-for-app: waiting 15 seconds for postgres:5432
wait-for-app: postgres:5432 is available after 0 seconds
wait-for-app: waiting 15 seconds for redis:6379
wait-for-app: timeout occurred after waiting 15 seconds for redis:6379

Maybe it is related to the new OS and GPU, but I'm not sure. Need to investigate further and figure out the problem.

Updated all environment variables and now: In the rare case rodan-python3-celery did launch and terminated with following log message:

 mkdir -p /var/www
+ mkdir -p /code/Rodan/staticfiles
+ chmod -R a+rwx /rodan
+ chmod a+rwx /var
+ chmod a+rwx /code/Rodan/AUTHORS /code/Rodan/LICENSE /code/Rodan/__init__.py /code/Rodan/_clean_database.sh /code/Rodan/helper_scripts /code/Rodan/manage.py /code/Rodan/poetry.lock /code/Rodan/pyproject.toml /code/Rodan/readme.md /code/Rodan/requirements.txt /code/Rodan/rodan /code/Rodan/staticfiles /code/Rodan/websocket.ini
+ trap _term SIGTERM
+ cd /code/Rodan
+ /run/wait-for-app postgres:5432
wait-for-app: waiting 15 seconds for postgres:5432
wait-for-app: timeout occurred after waiting 15 seconds for postgres:5432

and rodan-main has this error msg:

wait-for-app: waiting 15 seconds for postgres:5432
wait-for-app: timeout occurred after waiting 15 seconds for postgres:5432
wait-for-app: waiting 15 seconds for redis:6379
wait-for-app: timeout occurred after waiting 15 seconds for redis:6379
+ mkdir -p /var/www
+ mkdir -p /code/Rodan/staticfiles
+ chmod -R a+rwx /rodan
+ chmod a+rwx /var
+ chmod a+rwx /code/Rodan/AUTHORS /code/Rodan/LICENSE /code/Rodan/__init__.py /code/Rodan/_clean_database.sh /code/Rodan/helper_scripts /code/Rodan/manage.py /code/Rodan/poetry.lock /code/Rodan/pyproject.toml /code/Rodan/readme.md /code/Rodan/requirements.txt /code/Rodan/rodan /code/Rodan/staticfiles /code/Rodan/websocket.ini
+ trap _term SIGTERM
+ cd /code/Rodan
+ /run/wait-for-app postgres:5432
wait-for-app: waiting 15 seconds for postgres:5432
wait-for-app: timeout occurred after waiting 15 seconds for postgres:5432

But postgres-plpython is healthy and is giving desired output. After restarting, rodan-main is giving the same log as above (*).

homework36 commented 1 month ago

I'm able to launch with docker-compose -f production.yml up -d and all containers are "healthy" now but rodan2.simssa.ca is still not accessible. The default command (swarm mode) docker stack deploy --with-registry-auth -c production.yml rodan still leads to failing containers. checked but still get 502 Bad Gateway:

see updates in #1149