aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
411 stars 184 forks source link

Using s6 service ready notification instead of sleep #6469

Open unkcpz opened 3 weeks ago

unkcpz commented 3 weeks ago

fixes #6448

For the container that includes the services, aiida prepare script is called as soon as the RabbitMQ startup script has been launched, but it can take a while for the RMQ service to come up. If verdi presto is called straight away it is possible it tries to connect to the service before that and it will configure the profile without a broker.

This PR using s6-notifyoncheck to do the health check every 500ms until the RMQ is ready. As suggested by https://github.com/just-containers/s6-overlay/issues/582. From my investigation, RMQ didn't provide the pre-active readiness notification to the file descriptor by the its daemon. Thus I use readiness polling.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 77.74%. Comparing base (ef60b66) to head (152cbba). Report is 47 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #6469 +/- ## ========================================== + Coverage 77.51% 77.74% +0.24% ========================================== Files 560 561 +1 Lines 41444 41752 +308 ========================================== + Hits 32120 32456 +336 + Misses 9324 9296 -28 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sphuber commented 3 weeks ago

Thanks @unkcpz did you try to build and run locally? I could build just fine, but when I run

TAG=":latest" docker compose -f docker-compose.aiida-core-with-services.yml up

it seems to get stuck:

aiida-1  | selecting default max_connections ... 100
aiida-1  | selecting default shared_buffers ... 128MB
aiida-1  | selecting default time zone ... Europe/Zurich
aiida-1  | creating configuration files ... ok
aiida-1  | running bootstrap script ... ok
aiida-1  | performing post-bootstrap initialization ... 2024-06-07 15:06:49.743165+00:00 [warning] <0.130.0> Using RABBITMQ_ADVANCED_CONFIG_FILE: /opt/rabbitmq_server-3.10.18/etc/rabbitmq/advanced.config
aiida-1  | ok
aiida-1  | Error: unable to perform an operation on node 'rabbit@localhost'. Please see diagnostics information and suggestions below.
aiida-1  | 
aiida-1  | Most common reasons for this are:
aiida-1  | 
aiida-1  |  * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
aiida-1  |  * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
aiida-1  |  * Target node is not running
aiida-1  | 
aiida-1  | In addition to the diagnostics info below:
aiida-1  | 
aiida-1  |  * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
aiida-1  |  * Consult server logs on node rabbit@localhost
aiida-1  |  * If target node is configured to use long node names, don't forget to use --longnames with CLI tools
aiida-1  | 
aiida-1  | DIAGNOSTICS
aiida-1  | ===========
aiida-1  | 
aiida-1  | attempted to contact: [rabbit@localhost]
aiida-1  | 
aiida-1  | rabbit@localhost:
aiida-1  |   * connected to epmd (port 4369) on localhost
aiida-1  |   * epmd reports: node 'rabbit' not running at all
aiida-1  |                   no other nodes on localhost
aiida-1  |   * suggestion: start the node
aiida-1  | 
aiida-1  | Current node details:
aiida-1  |  * node name: 'rabbitmqcli-854-rabbit@localhost'
aiida-1  |  * effective user's home directory: /home/aiida
aiida-1  |  * Erlang cookie hash: bT3Sb8VwvV4uzQNrwGXRcQ==
aiida-1  | 
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | syncing data to disk ... 2024-06-07 15:06:51.599988+00:00 [info] <0.221.0> Feature flags: list of feature flags found:
aiida-1  | 2024-06-07 15:06:51.600028+00:00 [info] <0.221.0> Feature flags:   [ ] classic_mirrored_queue_version
aiida-1  | 2024-06-07 15:06:51.600045+00:00 [info] <0.221.0> Feature flags:   [ ] classic_queue_type_delivery_support
aiida-1  | 2024-06-07 15:06:51.600092+00:00 [info] <0.221.0> Feature flags:   [ ] implicit_default_bindings
aiida-1  | 2024-06-07 15:06:51.600110+00:00 [info] <0.221.0> Feature flags:   [ ] maintenance_mode_status
aiida-1  | 2024-06-07 15:06:51.600127+00:00 [info] <0.221.0> Feature flags:   [ ] quorum_queue
aiida-1  | 2024-06-07 15:06:51.600138+00:00 [info] <0.221.0> Feature flags:   [ ] stream_queue
aiida-1  | 2024-06-07 15:06:51.600198+00:00 [info] <0.221.0> Feature flags:   [ ] user_limits
aiida-1  | 2024-06-07 15:06:51.600209+00:00 [info] <0.221.0> Feature flags:   [ ] virtual_host_metadata
aiida-1  | 2024-06-07 15:06:51.600222+00:00 [info] <0.221.0> Feature flags: feature flag states written to disk: yes
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | 2024-06-07 15:06:51.976208+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped
aiida-1  | 2024-06-07 15:06:51.976270+00:00 [notice] <0.221.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
aiida-1  | 
aiida-1  |   ##  ##      RabbitMQ 3.10.18
aiida-1  |   ##  ##
aiida-1  |   ##########  Copyright (c) 2007-2023 VMware, Inc. or its affiliates.
aiida-1  |   ######  ##
aiida-1  |   ##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com
aiida-1  | 
aiida-1  |   Erlang:      24.2.1 [jit]
aiida-1  |   TLS Library: OpenSSL - OpenSSL 3.0.2 15 Mar 2022
aiida-1  |   Release series support status: supported
aiida-1  | 
aiida-1  |   Doc guides:  https://rabbitmq.com/documentation.html
aiida-1  |   Support:     https://rabbitmq.com/contact.html
aiida-1  |   Tutorials:   https://rabbitmq.com/getstarted.html
aiida-1  |   Monitoring:  https://rabbitmq.com/monitoring.html
aiida-1  | 
aiida-1  |   Logs: /home/aiida/.rabbitmq/log/rabbit@localhost.log
aiida-1  |         /home/aiida/.rabbitmq/log/rabbit@localhost_upgrade.log
aiida-1  |         <stdout>
aiida-1  | 
aiida-1  |   Config file(s): /opt/rabbitmq_server-3.10.18/etc/rabbitmq/advanced.config
aiida-1  | 
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  |   Starting broker...ok
aiida-1  | 
aiida-1  | 
aiida-1  | Success. You can now start the database server using:
aiida-1  | 
aiida-1  |     pg_ctl -D /home/aiida/.postgresql -l logfile start
aiida-1  | 
aiida-1  | initdb: warning: enabling "trust" authentication for local connections
aiida-1  | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
aiida-1  | s6-rc: info: service postgresql-init successfully started
aiida-1  | s6-rc: info: service postgresql: starting
aiida-1  | waiting for server to start.... done
aiida-1  | server started
aiida-1  | s6-rc: info: service postgresql successfully started
aiida-1  | Error:
aiida-1  | RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
aiida-1  | s6-rc: info: service postgresql-prepare successfully started
aiida-1  |  completed with 0 plugins.

but I am not sure if this is because I am doing something wrong

unkcpz commented 3 weeks ago

Yes, I tested. The reason you got stuck is that you run this image with docker compose. Docker compose firing all the containers but didn't know where you want to get into which container. You can run docker exec -i <container-id> bash to get into the container. What I usually run is by for with-services container is by docker run --rm -it aiidateam/aiida-core-with-services:latest bash. The bash is the entry point of the container.

sphuber commented 3 weeks ago

The reason you got stuck is that you run this image with docker compose. Docker compose firing all the containers but didn't know where you want to get into which container.

Huh, but the aiida-core-with-services.yml only defines the aiida container as the services are integrated. So why would it try to launch multiple containers as you say?

For what it is worth, I tried running:

docker run --rm -it aiidateam/aiida-core-with-services:latest bash

but it seems to get stuck with a problem in RabbitMQ at the same point

  Starting broker...Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /home/aiida/.postgresql -l logfile start

s6-rc: info: service postgresql-init successfully started
s6-rc: info: service postgresql: starting
waiting for server to start.... done
server started
s6-rc: info: service postgresql successfully started
s6-rc: info: service postgresql-prepare successfully started
 completed with 0 plugins.
unkcpz commented 3 weeks ago

That is weird, it is all fine in my computer, can you try to docker exec into the stuck container and run rabbitmq-diagnostics check_running from another shell?

sphuber commented 2 weeks ago

That is weird, it is all fine in my computer, can you try to docker exec into the stuck container and run rabbitmq-diagnostics check_running from another shell?

That actually seems to work:

(base) aiida@ecd0bdcd1a87:~$ rabbitmq-diagnostics check_running
Checking if RabbitMQ is running on node rabbit@localhost ...
RabbitMQ on node rabbit@localhost is fully booted and running
(base) aiida@ecd0bdcd1a87:~$ rabbitmq-diagnostics is_running
Asking node rabbit@localhost for its status ...
RabbitMQ on node rabbit@localhost is fully booted and running

But the shell with which I ran docker run keeps hanging and never drops me into the bash prompt. Weird....

If I CTRL+C the hanging docker run I get:

^Cs6-rc: warning: received INT, aborting longrun transitions and exiting asap
s6-rc: warning: unable to start service rabbitmq: command crashed with signal 2

so it does indeed seem to have a problem with the service startup. Note that after the CTRL+C it does drop me into a bash shell, but RMQ does not seem to be running and the profile has not been created.

(base) aiida@580ba47ac564:/home/aiida$ verdi status
 ✔ version:     AiiDA v2.5.1.post0
 ✔ config:      /home/aiida/.aiida
 ⏺ profile:     no profile configured yet
Report: Run `verdi presto` to automatically setup a profile using all defaults or use `verdi profile setup` for more control.
(base) aiida@580ba47ac564:/home/aiida$ verdi presto
Report: Option `--use-postgres` not enabled: configuring the profile to use SQLite.
Report: RabbitMQ server not found: configuring the profile without a broker.
Report: Initialising the storage backend.
Report: Storage initialisation completed.
Success: Created new profile `presto`.
Success: Configured the localhost as a computer.

Final data point: when I leave the docker run blocking and hanging, and then docker exec into it in another shell, in that shell RMQ seems to be running and so running verdi presto works as expected and it can connect to RMQ.

danielhollas commented 2 weeks ago

FWIW On Fedora 39 using Podman instead of docker, the following also worked for me (I got a bash interactive prompt)

podman run --rm -it ghcr.io/aiidateam/aiida-core-with-services:latest bash
sphuber commented 2 weeks ago

FWIW On Fedora 39 using Podman instead of docker, the following also worked for me (I got a bash interactive prompt)

podman run --rm -it ghcr.io/aiidateam/aiida-core-with-services:latest bash

But that latest on ghcr.io is not the image from this PR, is it? I think the current workflow definition no longer pushes to the registry in a PR. Could you perhaps try checking out this branch and build locally and try to run it @danielhollas ?

danielhollas commented 2 weeks ago

I am afk and will not get to this in the near future. But the image will get pushed to ghcr.io if you push this branch to origin. In general PRs modifying the docker image should be opened from branches on this origin repo for this reason.

unkcpz commented 2 weeks ago

@danielhollas I think if I make the PR from repo to trigger the whole docker CI, it is not push to ghcr.io with pr-xx tag anymore, correct? Otherwise we can ask @sphuber to try the image being built from CI. At least from the all passed tests that the image built from CI is working as expected..

danielhollas commented 2 weeks ago

It will. The tag will be different, it's the name of the branch. Also you can always download by digest.

unkcpz commented 2 weeks ago

It is not apparently, or I did something wrong? see https://github.com/aiidateam/aiida-core/pull/6475

danielhollas commented 2 weeks ago

Hmm, that's definitely not working as intended then. Please take a look at docker.yml, maybe I made a mistake there. Is the workflow running on main?

Dne čt 13. 6. 2024 11:28 uživatel Jusong Yu @.***> napsal:

It is not apparently, or I did something wrong? see #6475 https://github.com/aiidateam/aiida-core/pull/6475

— Reply to this email directly, view it on GitHub https://github.com/aiidateam/aiida-core/pull/6469#issuecomment-2165116805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIY64IPWBNUJW4PL3COR7TZHFQ3LAVCNFSM6AAAAABI6OJABGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVGEYTMOBQGU . You are receiving this because you were mentioned.Message ID: @.***>

unkcpz commented 2 weeks ago

Is the workflow running on main?

Yes, it runs in main.

skarnet commented 2 weeks ago

@sphuber It sounds like rabbitmq-server is properly starting, but the rabbitmq-diagnostics check_running program invoked from data/check does not properly detect that, or fails to exit 0, so the supervisor does not detect that rabbitmq-server is ready. Could you redirect, in data/check, the stdout and stderr of rabbitmq-diagnostics to a file, and post the contents so we can see what's going on?

sphuber commented 2 weeks ago

@skarnet thanks for the suggestion. It basically shows the same output I already shared before:

Checking if RabbitMQ is running on node rabbit@localhost ...
Error: unable to perform an operation on node 'rabbit@localhost'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@localhost
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: [rabbit@localhost]

rabbit@localhost:
  * connected to epmd (port 4369) on localhost
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on localhost
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-371-rabbit@localhost'
 * effective user's home directory: /home/aiida
 * Erlang cookie hash: nwhUk5qM8X4W/2Jko5yVSg==

Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)
Checking if RabbitMQ is running on node rabbit@localhost ...
Error:
RabbitMQ on node rabbit@localhost is not running or has not fully booted yet (check with is_booting)

But again when I drop into the container with docker exec in another shell, rabbitmq-diagnostics check_running confirms that RMQ is up and running

skarnet commented 1 week ago

That is strange; there has to be some environmental difference between the failing and successful invocations of rabbitmq-diagnostics check_running.

Can you strace both, so we can take a look at the result? i.e. in data/check, as well as in your docker exec environment, replace the invocation with strace -v -s256 -- rabbitmq-diagnostics check_running and store the stderr somewhere. Caution: it will likely be very verbose.