Netflix / conductor-community

Apache License 2.0
61 stars 72 forks source link

pollMessages() in PostgresQueueDAO does not break when popMessages() returns empty message list. #236

Open Veera93 opened 1 year ago

Veera93 commented 1 year ago

Describe the bug In pollMessages implementation we need to address two issues when messagesSlices object is empty.

  1. The loop continues when there are no eligible messages in the queue. This leads to unnecessary querying of the database until the timeout is reached, and the sleep time of 100ms escalates the problem. The mix of these two together causes high queries per second at scale.
  2. When the number of messages is less than the count specified, the thread blocks and holds onto the previous messages in the memory until timeout happens or the count is reached.

Please share your thoughts on this. If we agree, I can raise a Merge Request for the same.

Details Conductor version: any Persistence implementation: Postgres Queue implementation: Postgres

To Reproduce Steps to reproduce the behavior:

  1. Run conductor with Postgres as the persistence layer for Queue implementation.
  2. Monitor QPS in Postgres using pg_stat_statements Note: Here we are only seeing the effect of pollMessages called from the WorkflowReconciler where the timeout is 2000ms and count depends on the CPU core.

Expected behavior

Return the messages that are present instead of waiting and retrying till the timeout or count.

Screenshots Just started the service and execute pg_stat_statements. Below, is the number of queries executed per minute from one instance just for WorkflowReconciler. The QSP increases exponentially if there are multiple instances and if the workers poll for activity.

Screenshot 2023-06-20 at 3 55 16 PM

Note: Here we are only seeing the effect of pollMessages called from the WorkflowReconciler. There are other places where pollMessage is being invoked.

Additional context The MySQL implementation of the poll does not have this while(true) loop.

Veera93 commented 1 year ago

In the conductor version 3.13.7, I see that the default timeout is 2000 ms, but we can override it to 0 which will execute timeout < 1 the if block for the desired results. But ideally, this timeout should be the query timeout. If this is changed in the future it would cause an issue.

Just curious to know why we are not using this timeout value as a query timeout for the database call.