Closed chrisjsewell closed 4 weeks ago
I feel maybe we can put this in the broker_parameters
: https://github.com/aiidateam/aiida-core/blob/4174e5de3adbeec785290a02a0fc78d4597e42e0/aiida/manage/configuration/schema/config-v5.schema.json#L322
Two question:
false
; in the documentation it implies it has to be an integer, but in the PR they specifically mention false https://github.com/rabbitmq/rabbitmq-server/pull/2990#issuecomment-846033907thoughts @sphuber?
trying it out in #5106
I remember looking into the default timeouts a long time ago and I think it is not a value that can be configured from the client. This has to be configured on the server itself. There even was a maximum defined that could not be surpassed. So even if you put a value above it in the config, it would be capped at the hardcoded value. This may have been for older versions of RabbitMQ (around 3.5) and not sure if that is still there. All there reasoning is that the main use case for RabbitMQ is that these should be "quick" jobs on the order of seconds.
yeh cheers #5106 does not appear to fail rabbitmq, but obviously no idea yet if it is actually having any affect
Hmm, yeh no joy yet; trying to set consumer_timeout to 1 in #5106, but that doesn't seem to fail anything
Yeh no I guess it is not part of https://www.rabbitmq.com/uri-query-parameters.html#tls 😒
I asked about adding it: https://github.com/rabbitmq/rabbitmq-server/pull/2990#issuecomment-908405800, or maybe I should open an actual issue if they don't respond
Ok opened: rabbitmq/rabbitmq-server#3344 🤞
Well that was a dead end (we kinda use rabbitmq in a way it is not designed for)
So why don't we just remove it entirely 😉 https://github.com/chrisjsewell/aiida-process-coordinator/discussions/4
I just had the same issue - Channel closed error for something running > 30 minutes. I checked and indeed I have rabbitmq 3.8.16. We'll probably need to focus on replacing rmq as soon as 2.0 is out... However, I'm sure many people will have this error in 2.0 as now recent versions are >3.7.
Can we make this requirement more obvious?
E.g. check in verdi status
and print an error that the version of RMQ is not supported and one has to downgrade, at least for the time being?
Adding link to another project encountering the same issue: https://github.com/celery/celery/issues/6760
After accidentally getting my rabbitmq updated to 3.9.x I also faced this same issue. And I would like to point out that the simplest way to downgrade rabbitmq would be to use conda instead of debian package. Otherwise one needs to manually downgrade all dependencies like erlang which has its own dependencies and it creates a big mess.
So for anyone stumbling here, running following is all that's required.
conda install -c conda-forge rabbitmq-server=3.7.28
Maybe @giovannipizzi @chrisjsewell we can add this in the wiki where you discuss this issue?
yeh, as we have just been discussing, I think it is a nicer solution, in terms of dependency management (as opposed to apt or homebrew), but the downside is no automated setup of a background service, using e.g. launchctl (osx), systemd (linux)
Out of interest, I have just posted here, to ask about such a feature https://groups.google.com/a/anaconda.com/g/anaconda/c/z36jZTlJG8g
I've just had the issue with the channel closed error, while running the RabbitMQ v3.9.13. I have increased the consumer_timeout as per the documentation, but the jobs crashed after about 5 hours. I have some even older jobs running now, so I'm not sure if this is related to the timeout.
Going through the RabbitMQ documentation, I have noticed a possible mistake in the Aiida documentation. It suggests:
# 100 hours in milliseconds (increase if you expect your workflows to run longer)
consumer_timeout = 3600000
however this appears to actually correspond to 1 hour, which is also what the RabbitMQ documentation says.
Thanks for the report @Zeleznyj . Indeed, our wiki is incorrect and that is one hour, which would explain the error. Could you try to up it to lets say 3600000000
(a 1000 hours, just to be on the safe side) and restart the RabbitMQ service? Make sure to stop the daemon first and restart it when RabbitMQ is back up and running.
I will update the wiki now.
I have tried increasing it, let's see if that helps, but the error is clearly somewhat random.
I have encountered the error before and thought it's related to this since I'm running Aiida on laptop, but this time the computer was on the whole time the jobs were running.
Has anyone ever tried using the advanced.config
to disable the timeout completely? The documentation (https://www.rabbitmq.com/consumers.html#acknowledgement-timeout) specifies that this should be possible by adding the following to a file named advanced.config
:
%% advanced.config
[
{rabbit, [
{consumer_timeout, undefined}
]}
].
@ahkole I tried RabbitMQ 3.11.4 with the advanced config:
cat > ~/rabbitmq.notimeout.advanced.config <<EOF
%% advanced.config
[
{rabbit, [
{consumer_timeout, undefined}
]}
].
EOF
export RABBITMQ_ADVANCED_CONFIG_FILE=~/rabbitmq.notimeout.advanced.config
rabbitmq-server
and everything worked as expected
Ok, right now verdi status
returns correct instructions
✔ version: AiiDA v2.6.2.post0 ✔ config: /tmp/pytest-of-khosra_a/pytest-10/fc80b65e071f67ef50d89cc715645faa0/.aiida ✔ profile: temp-profilecore.sqlite_dos ✔ storage: SqliteDosStorage[/tmp/pytest-of-khosra_a/pytest-10/test_sqlite_version_core_sqlit0]: open, Warning: RabbitMQ v3.12.1 is not supported and will cause unexpected problems! Warning: It can cause long-running workflows to crash and jobs to be submitted multiple times. Warning: See https://github.com/aiidateam/aiida-core/wiki/RabbitMQ-version-to-use for details. ✔ broker: RabbitMQ v3.12.1 @ amqp://guest:guest@127.0.0.1:5672?heartbeat=600 ⏺ daemon: The daemon is not running.
I don't know in which PR this was solved, but I think we can close in here..
Alright, just found it. Adding here for the record: https://github.com/aiidateam/aiida-core/issues/5317
I don't know in which PR this was solved, but I think we can close in here..
well It's up to you, but... I would say that is the solution to the "symptom", not the underlying problem (that rabbitmq is absolutely is really not intended to be used this way) 😅
In https://github.com/rabbitmq/rabbitmq-server/pull/2990 a
consumer_timeout
has been introduced and set to 15 minutes, meaning that any process task that takes longer than 15 minutes will be cancelled 😬 (there is people in that PR none too happy that this was introduced in a minor version)The quick fix for this for users is either (a) use rabbitmq 3.7 or lower, or (b) configure
consumer_timeout
to false. (see also https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)As is literally the last comment in that PR, at the time of writing, it is unclear to me off-hand if this can be done using the API (i.e. something aiida-core can handle automatically)?