galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

MQ unacknowledged retries are not very multiprocess-intelligent #344

Open natefoo opened 1 year ago

natefoo commented 1 year ago

If enabled with amqp_acknowledge, the client will request acknowledgement of messages it has sent by requesting a return message on the corresponding _ack exchange. If it does not receive that ack within amqp_ack_republish_time, it will resend the message.

On the other end, a receipt is kept of messages that have been received and acted upon, so that if acknowledgements were sent but not received, retransmission does not result in duplication of work.

If you have multiple handlers serving the same Pulsar endpoints, each one of these handlers will start an ack manager that will review any unacknowledged messages, retransmit any over amqp_ack_republish_time, and then go to sleep for 15 seconds before doing it again. There is no locking between processes, only internally, so multiple acknowledgement managers could retransmit the same message. This is probably harmless, but not ideal.

Additionally, the acknowledgement persistence_directory needs to be on a shared filesystem if handlers are located on multiple hosts.

Ideally for any unique set of runner + manager, handlers would elect one of themselves as the acknowledgement manager.