galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Restarting MQ breaks Pulsar publishing #316

Closed natefoo closed 1 year ago

natefoo commented 1 year ago

Everything else in the equation (Galaxy consume and publish, Pulsar consume) all seem to recover from an MQ downtime/restart, but Pulsar publishing doesn't (and IIRC still claims to be publishing even if it's not as in #291). @cat-bro and @mira-miracoli mentioned that they have both seen this on their servers as well.

natefoo commented 1 year ago

Interestingly, I had to upgrade/restart RabbitMQ today, and the thing that failed to recover was consuming status updates on the Galaxy side.

hexylena commented 1 year ago

Reported in small scale admins: known issue for others, Martin wrote a cron job to restart pulsar.

mira-miracoli commented 1 year ago

I set it to every hour, should not cause problems to do it frequently, I assume(?)

mvdbeek commented 1 year ago

We don't pin kombu to a minimum version, but I think we should. I logged in to jetstream2 and saw we got kombu==5.1.0, while https://github.com/celery/kombu/pull/1269 adds a timeout to the publish call. I think that would likely fix this issue.

mvdbeek commented 1 year ago

Don't update it yet, I have some work to add the timeout option, but I need to test it still.

mvdbeek commented 1 year ago

That should be fixed by https://github.com/galaxyproject/pulsar/pull/324, I've pushed a prerelease to https://pypi.org/project/pulsar-app/0.15.2.dev0/ and https://pypi.org/project/pulsar-galaxy-lib/0.15.2.dev0/.

Note that this needs a client (https://github.com/galaxyproject/galaxy/compare/release_23.0...mvdbeek:galaxy:pulsar_update?expand=1) and server (=pulsar-app) update.

Let me know if this seems to work correctly and I'll publish another stable release.

mira-miracoli commented 1 year ago
mira-miracoli commented 1 year ago

I updated on all machines on the pulsar-node side (waiting for approval to update the galaxy side). However pulsar does not connect to MQ and gives me this error:

Pulsar Log ```shell Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,527 WARNI [pulsar.client.amqp_exchange][consume-setup-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,527 WARNI [pulsar.client.amqp_exchange][consume-setup-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,530 WARNI [pulsar.client.amqp_exchange][consume-status-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,535 WARNI [pulsar.client.amqp_exchange][consume-setup-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,535 WARNI [pulsar.client.amqp_exchange][consume-status-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,537 WARNI [pulsar.client.amqp_exchange][consume-status-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,538 WARNI [pulsar.client.amqp_exchange][consume-kill-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:51 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:51,539 WARNI [pulsar.client.amqp_exchange][consume-kill-pyamqp://galaxy_mira_pulsar:********@mq.galaxyproject.eu:5671//pulsar/galaxy_mira_pulsar?ssl=1] Got timeout, will retry: The read operation timed out Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,326 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_production__kill] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,326 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_benchmarking__setup] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,326 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_production__setup] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,330 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_test__status] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,334 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_production__status] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,335 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_test__setup] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,336 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_benchmarking__status] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,338 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_benchmarking__kill] AMQP heartbeat thread exiting Apr 25 10:19:52 vgcn-mira-central-manager.pulsar.novalocal pulsar[1441339]: 2023-04-25 10:19:52,338 DEBUG [pulsar.client.amqp_exchange][consume-heartbeat-pulsar_test__kill] AMQP heartbeat thread exiting ```
mvdbeek commented 1 year ago

The first message has a timestamp of 10:19:51,527, the last has a timestamp of 10:19:52,338, that's 800ms, do you see more messages after this, what did you set amqp_publish_timeout to and do jobs runs fine ?

mira-miracoli commented 1 year ago

sorry I forgot to mention that this block repeats all the time I did not find a key with amqp_publish_timeout in my pulsar config files. If you mean the RabbitMQ consumer_timeout, this is 21600000

mira-miracoli commented 1 year ago

I made a py-spy dump here: https://gist.github.com/mira-miracoli/adedd4d802b4bd4fa2bd89ffdc0731b4

mira-miracoli commented 1 year ago

I added the amqp_publish_timeout: 2.0 from new app.yml.sample and restarted, but get still the same error

mira-miracoli commented 1 year ago

Jobs are running normal when I submit them directly with htcondor, since pulsar is not connected to MQ, it does not take jobs from galaxy

mvdbeek commented 1 year ago

Can you bump up the timeout ? Some of these messages could be normal

mira-miracoli commented 1 year ago

I bumped it to 20.0 and to 200.0 and get the same result

mvdbeek commented 1 year ago

Hmm, I also see those messages, but they're harmless and jobs complete as usual. I'll push a fix to silence those, but this should've still worked.

mira-miracoli commented 1 year ago

My pulsar node is not connecting to rabbitMQ, so for me it is a problem

mira-miracoli commented 1 year ago

I don't see any pulsar node currently connecting to RabbitMQ

mvdbeek commented 1 year ago

It does for me though, are you sure that's not an unrelated problem ?

mira-miracoli commented 1 year ago

I double-checked the connection string but it still does not work

mvdbeek commented 1 year ago

Can you try https://pypi.org/project/pulsar-app/0.15.2.dev1/ ? It doesn't really make a difference for me, apart from the log messages, but worth a try

mira-miracoli commented 1 year ago

It is connecting and jobs are running now, thank you!

mvdbeek commented 1 year ago

Awesome, thanks for the feedback!