aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
411 stars 184 forks source link

Devops: Fix the Docker builds #6445

Closed sphuber closed 3 weeks ago

sphuber commented 3 weeks ago

In a recent commit, the aiida-prepare.sh startup script was updated to switch from the deprecated verdi quicksetup to verdi presto. This left the builds broken since there were a few bugs in the implementation of verdi presto, but also because the behavior of the commands is not identical:

sphuber commented 3 weeks ago

@danielhollas It was actually #6432 that broke the Docker builds.

There is a number of problems we need to tackle:

sphuber commented 3 weeks ago

The real problem seems to be that the aiida-prepare.sh init script is being running before the RMQ service is up. In the aiida-core-with-services case, the RMQ init is called before, but it just issues the startup command but doesn't explicitly wait for the service to be up and it continues straight to running aiida-prepare.sh. This then calls verdi presto but RMQ is still starting up and the test connection fails, causing verdi presto to configure the profile without broker.

For the aiida-core-base case, probably the health check for rabbitmq was incorrect. We switch to rabbitmq-diagnostics -q ping which should hopefully do the trick.

sphuber commented 3 weeks ago

@unkcpz seems to have worked except with the healthcheck for aiida-core-base

unkcpz commented 3 weeks ago

But seems the problem of aiida-core-base is from postgresql (the change with sleep 10 was made in base thus should take effect as well). I start with docker compose and create the profile with quicksetup, it says the postgresql connection not found.

unkcpz commented 3 weeks ago

Instead of using sleep 10, the standard way s6 handle the health check would be https://skarnet.org/software/s6/s6-svstat.html, not sure how to set the command, require some experiments.

sphuber commented 3 weeks ago

But seems the problem of aiida-core-base is from postgresql (the change with sleep 10 was made in base thus should take effect as well). I start with docker compose and create the profile with quicksetup, it says the postgresql connection not found.

Huh 🤔 that's weird, because it used to work before I switched to verdi presto. So why is the PSQL service not running anymore? Did it give a precise error message?

unkcpz commented 3 weeks ago

Did it give a precise error message?

It just says "Unable to autodetact postgres setup". I am sure the container is running. I'll test with old image see if it the problem from the aiida-core-base image.

unkcpz commented 3 weeks ago

I'll test with old image see if it the problem from the aiida-core-base image.

Pretty sure it is related to the new image, it cannot autodetect the postgresql. I run with old base image and all success.

unkcpz commented 3 weeks ago

To reproduce the problem, build the images with docker buildx bake and run docker compose with:

For old working container:

REGISTRY=ghcr.io/ TAG=":2.5.1" docker compose -f docker-compose.aiida-core-base.yml up

For newly baked container:

REGISTRY=ghcr.io/ TAG=":latest" docker compose -f docker-compose.aiida-core-base.yml up
unkcpz commented 3 weeks ago

I comment out the --use-postgres and the profile get setup. So I guess the autodetect behavior was changed from last release?

danielhollas commented 3 weeks ago

Huh 🤔 that's weird, because it used to work before I switched to verdi presto. So why is the PSQL service not running anymore? Did it give a precise error message?

@sphuber I believe you deleted some yaml configuration for psql in that PR, perhaps it is needed after all?

sphuber commented 3 weeks ago

I comment out the --use-postgres and the profile get setup. So I guess the autodetect behavior was changed from last release?

That makes sense, without the --use-postgres flag verdi presto won't even try to connect.

sphuber commented 3 weeks ago

@sphuber I believe you deleted some yaml configuration for psql in that PR, perhaps it is needed after all?

That is probably the issue here. It doesn't need all of it, but since it cannot detect the server, there is probably a problem with the default hostname, username and password that it is using to try to connect.

sphuber commented 3 weeks ago

Alright @unkcpz I finally figured out the various problems. I have cleaned up the changes and updated the OP with a description.