Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.82k stars 2.34k forks source link

Throughput issue with SQS and long polling #3541

Open bjpirt opened 1 year ago

bjpirt commented 1 year ago

Describe the bug When configuring Conductor to consume messages from SQS, if the Conductor queues are configured with long polling, then an empty queue will slow down consumption from all other queues. This is a problem if your queue throughput is not all high.

Example Conductor creates two queues for internal use for asyncronous processing which we don't use. We also create a queue to trigger workflows via SQS which gets used lots. All three queues were set with a 20 second long polling. The current implementation seems to check the queues in a serial manner and would therefore only pull one message off the active queue every 40 seconds. When we changed the long polling receive timeout to one second on the two inactive queues, the throughput went up accordingly.

Details Conductor version: 3.13.3 Persistence implementation: all Queue implementation: SQS

To Reproduce Steps to reproduce the behavior:

  1. Configure with SQS
  2. Add an event handler listening to a third queue
  3. Configure all queues to have a 20s receive timeout
  4. Try to push messages into the event handler queue

Expected behavior Queue listeners should run independent of eachother so that a quiet queue with long polling configured will not impact on throughput

bjpirt commented 1 year ago

Guessing nobody else is experiencing this - it's quite a big problem for us :-(