conductor-oss / conductor

Conductor is an event driven orchestration platform
https://conductor-oss.org
Apache License 2.0
17.84k stars 457 forks source link

Task completion event lost #176

Open ravig-kant opened 4 months ago

ravig-kant commented 4 months ago

Describe the bug We are facing an issue where a conductor task remains in progress. This task executes in a do-while loop along with other tasks. The sequence of tasks in do-while is as follows. UploadPrepare -> Upload_collectItem_Output -> Upload_item_start -> Upload -> Upload_item_end

In the annexed screenshot, for iteration 135, the Upload_item_start__135 is IN_PROGRESS. We have already marked task Upload_item_start135 as COMPLETED. It triggered the next task of the same iteration i.e. Upload135. Also, the next task is COMPLETED. This seems like a case of lost updates. Moreover, the workflow is never completed.

Details Conductor version: 3.18 Persistence implementation: Postgres Queue implementation: Dynoqueues Lock: Redis Workflow definition:

Task definition: Event handler definition:

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior The task and the workflow should have been completed.

Screenshots

Screenshot 2024-06-04 at 2 07 39 PM

Additional context Add any other context about the problem here.

v1r3n commented 3 months ago

Hi @ravig-kant what database backend are you using?

ravig-kant commented 3 months ago

We are using postgres as backend @v1r3n

aradu-atlassian commented 2 months ago

This is not a race condition within the persistence engine being used, but rather one of the general design. In this example what we have is the task emitting a kafka message, and the response to mark the task as complete comes before the task is marked as in progress. The remaining code on the original thread to mark the task as in progress then executes and moves from complete -> in progress.

This behaviour would be the same with any persistence engine and would only be able to be fixed if the update logic itself had a bit more complexity and logic to handle this case (potentially through conditional updates).