Closed astalosj closed 1 year ago
We are seeing this too on Galaxy Australia, only since upgrading from 0.14.13 to 0.15.2 two weeks ago.
Based on suggestion from @mira-miracoli I've downgraded pulsar to version 0.15.2.dev1. There were some consumer errors right after rebooting pulsar central manager (Jul 14) but since then there are no errors in the logs.
There are no code changes between 0.15.2.dev1 and 0.15.2, the only difference is the update in the changelog. You can see this in https://github.com/galaxyproject/pulsar/compare/0.15.1...0.15.2
@astalosj thanks for the report and the logs. This should be fixed in 0.15.3 released yesterday.
Thanks, I've updated pulsar to 0.15.3.
After update to 0.15.3 pulsar stopped publishing job results (after some time). According to the logs it sent job status updates to AMQ but Galaxy did not fetch them (there are 12 messages in pulsar_production__status_update queue for my vhost). Let me know if it might be related, or if I should fill it as separate issue.
Yes, that looks like a separate issue. If you have logs from the Galaxy side that would be helpful.
@astalosj did you manage to get any logs from Galaxy if this happens ?
@mvdbeek I think we also need to catch TimeoutError
?
2023-08-23 15:04:04,686 ERROR [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@amqp.galaxyproject.org:5671//main_pulsar?ssl=1] Problem consuming queue, consumer quitting in problematic fashion!
Traceback (most recent call last):
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/client/amqp_exchange.py", line 141, in consume
connection.drain_events(timeout=self.__timeout)
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/kombu/connection.py", line 318, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/kombu/transport/pyamqp.py", line 135, in drain_events
return connection.drain_events(**kwargs)
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/amqp/connection.py", line 523, in drain_events
while not self.blocking_read(timeout):
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/amqp/connection.py", line 528, in blocking_read
frame = self.transport.read_frame()
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/amqp/transport.py", line 299, in read_frame
frame_header = read(7, True)
File "/expanse/projects/qstore/pen160/xgalaxy/main/pulsar/venv/lib/python3.9/site-packages/amqp/transport.py", line 573, in _read
s = recv(n - len(rbuf)) # see note above
File "/expanse/projects/qstore/pen160/xgalaxy/conda/envs/__python@3.9/lib/python3.9/ssl.py", line 1101, in read
return self._sslobj.read(len)
TimeoutError: [Errno 110] Connection timed out
We do, that is very weird. Is kombu maybe outdated ?
To answer my own question, that only works in 3.10+, will be fixed in https://github.com/galaxyproject/pulsar/pull/337
My pulsar endpoint (v0.15.2) connected to usegalaxy.eu (through mq.galaxyproject.eu RabbitMQ broker) sometimes stops accepting jobs. The jobs are stuck at Galaxy server and they do not appear in pulsar logs. After restarting pulsar the jobs start to run and finish without problems. Pulsar endpoint was installed from vggp-v60-j224-e0d36d08062d-dev image (Rocky Linux 9).
There are errors in the pulsar logs before it stops accepting jobs:
The errors appeared at (times are in CEST):
The time 07:04 is most critical but there's nothing unusual in the logs. The pulsar service is restarted by cron daily at 6:11. @mira-miracoli didn't find anything relevant in the mq.galaxyproject.eu logs.
Python packages versions: