aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
435 stars 189 forks source link

AiiDA will no longer work with rabbitmq>3.7 by default #5105

Closed chrisjsewell closed 4 weeks ago

chrisjsewell commented 3 years ago

In https://github.com/rabbitmq/rabbitmq-server/pull/2990 a consumer_timeout has been introduced and set to 15 minutes, meaning that any process task that takes longer than 15 minutes will be cancelled 😬 (there is people in that PR none too happy that this was introduced in a minor version)

The quick fix for this for users is either (a) use rabbitmq 3.7 or lower, or (b) configure consumer_timeout to false. (see also https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)

As is literally the last comment in that PR, at the time of writing, it is unclear to me off-hand if this can be done using the API (i.e. something aiida-core can handle automatically)?

chrisjsewell commented 3 years ago

I feel maybe we can put this in the broker_parameters: https://github.com/aiidateam/aiida-core/blob/4174e5de3adbeec785290a02a0fc78d4597e42e0/aiida/manage/configuration/schema/config-v5.schema.json#L322

Two question:

  1. will rabbitmq<3.8 complain if passed a parameter that it does not know?
  2. Can we actually set the default as false; in the documentation it implies it has to be an integer, but in the PR they specifically mention false https://github.com/rabbitmq/rabbitmq-server/pull/2990#issuecomment-846033907

thoughts @sphuber?

chrisjsewell commented 3 years ago

trying it out in #5106

sphuber commented 3 years ago

I remember looking into the default timeouts a long time ago and I think it is not a value that can be configured from the client. This has to be configured on the server itself. There even was a maximum defined that could not be surpassed. So even if you put a value above it in the config, it would be capped at the hardcoded value. This may have been for older versions of RabbitMQ (around 3.5) and not sure if that is still there. All there reasoning is that the main use case for RabbitMQ is that these should be "quick" jobs on the order of seconds.

chrisjsewell commented 3 years ago

yeh cheers #5106 does not appear to fail rabbitmq, but obviously no idea yet if it is actually having any affect

chrisjsewell commented 3 years ago

Hmm, yeh no joy yet; trying to set consumer_timeout to 1 in #5106, but that doesn't seem to fail anything

chrisjsewell commented 3 years ago

Yeh no I guess it is not part of https://www.rabbitmq.com/uri-query-parameters.html#tls 😒

I asked about adding it: https://github.com/rabbitmq/rabbitmq-server/pull/2990#issuecomment-908405800, or maybe I should open an actual issue if they don't respond

chrisjsewell commented 3 years ago

Ok opened: https://github.com/rabbitmq/rabbitmq-server/issues/3344 🤞

chrisjsewell commented 3 years ago

Ok opened: rabbitmq/rabbitmq-server#3344 🤞

Well that was a dead end (we kinda use rabbitmq in a way it is not designed for)

So why don't we just remove it entirely 😉 https://github.com/chrisjsewell/aiida-process-coordinator/discussions/4

giovannipizzi commented 2 years ago

I just had the same issue - Channel closed error for something running > 30 minutes. I checked and indeed I have rabbitmq 3.8.16. We'll probably need to focus on replacing rmq as soon as 2.0 is out... However, I'm sure many people will have this error in 2.0 as now recent versions are >3.7.

Can we make this requirement more obvious? E.g. check in verdi status and print an error that the version of RMQ is not supported and one has to downgrade, at least for the time being?

chrisjsewell commented 2 years ago

Adding link to another project encountering the same issue: https://github.com/celery/celery/issues/6760

tsthakur commented 2 years ago

After accidentally getting my rabbitmq updated to 3.9.x I also faced this same issue. And I would like to point out that the simplest way to downgrade rabbitmq would be to use conda instead of debian package. Otherwise one needs to manually downgrade all dependencies like erlang which has its own dependencies and it creates a big mess.

So for anyone stumbling here, running following is all that's required.

conda install -c conda-forge rabbitmq-server=3.7.28

Maybe @giovannipizzi @chrisjsewell we can add this in the wiki where you discuss this issue?

chrisjsewell commented 2 years ago

yeh, as we have just been discussing, I think it is a nicer solution, in terms of dependency management (as opposed to apt or homebrew), but the downside is no automated setup of a background service, using e.g. launchctl (osx), systemd (linux)

Out of interest, I have just posted here, to ask about such a feature https://groups.google.com/a/anaconda.com/g/anaconda/c/z36jZTlJG8g

Zeleznyj commented 2 years ago

I've just had the issue with the channel closed error, while running the RabbitMQ v3.9.13. I have increased the consumer_timeout as per the documentation, but the jobs crashed after about 5 hours. I have some even older jobs running now, so I'm not sure if this is related to the timeout.

Going through the RabbitMQ documentation, I have noticed a possible mistake in the Aiida documentation. It suggests:

# 100 hours in milliseconds (increase if you expect your workflows to run longer)
consumer_timeout = 3600000

however this appears to actually correspond to 1 hour, which is also what the RabbitMQ documentation says.

sphuber commented 2 years ago

Thanks for the report @Zeleznyj . Indeed, our wiki is incorrect and that is one hour, which would explain the error. Could you try to up it to lets say 3600000000 (a 1000 hours, just to be on the safe side) and restart the RabbitMQ service? Make sure to stop the daemon first and restart it when RabbitMQ is back up and running.

I will update the wiki now.

Zeleznyj commented 2 years ago

I have tried increasing it, let's see if that helps, but the error is clearly somewhat random.

I have encountered the error before and thought it's related to this since I'm running Aiida on laptop, but this time the computer was on the whole time the jobs were running.

ahkole commented 1 year ago

Has anyone ever tried using the advanced.config to disable the timeout completely? The documentation (https://www.rabbitmq.com/consumers.html#acknowledgement-timeout) specifies that this should be possible by adding the following to a file named advanced.config:

%% advanced.config
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].
rikigigi commented 1 year ago

@ahkole I tried RabbitMQ 3.11.4 with the advanced config:

cat > ~/rabbitmq.notimeout.advanced.config <<EOF 
%% advanced.config
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].
EOF
export RABBITMQ_ADVANCED_CONFIG_FILE=~/rabbitmq.notimeout.advanced.config
rabbitmq-server

and everything worked as expected

khsrali commented 4 weeks ago

Ok, right now verdi status returns correct instructions

✔ version: AiiDA v2.6.2.post0 ✔ config: /tmp/pytest-of-khosra_a/pytest-10/fc80b65e071f67ef50d89cc715645faa0/.aiida ✔ profile: temp-profilecore.sqlite_dos ✔ storage: SqliteDosStorage[/tmp/pytest-of-khosra_a/pytest-10/test_sqlite_version_core_sqlit0]: open, Warning: RabbitMQ v3.12.1 is not supported and will cause unexpected problems! Warning: It can cause long-running workflows to crash and jobs to be submitted multiple times. Warning: See https://github.com/aiidateam/aiida-core/wiki/RabbitMQ-version-to-use for details. ✔ broker: RabbitMQ v3.12.1 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600 ⏺ daemon: The daemon is not running.

I don't know in which PR this was solved, but I think we can close in here..

khsrali commented 4 weeks ago

Alright, just found it. Adding here for the record: https://github.com/aiidateam/aiida-core/issues/5317

chrisjsewell commented 4 weeks ago

I don't know in which PR this was solved, but I think we can close in here..

well It's up to you, but... I would say that is the solution to the "symptom", not the underlying problem (that rabbitmq is absolutely is really not intended to be used this way) 😅