Open salaboy opened 2 months ago
if you look at the dapr sidecar logs you will see entries like
WARN[0513] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded' app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0573] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded' app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0613] Activity actor '108adc75-08df-494b-99ec-65735f690802::1::1': 'run-activity' is still running - will keep waiting until '2024-04-25 11:33:18.632479 +0200 CEST m=+3613.908154293' app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
what is happening with this test it that it starts the worker that connects via grpc with the dapr sidecar, and then it schedules the workflow so it starts running, and as soon as the workflow starts running your test exits which also exits the worker and the grpc connection to the sidecar closes.
To my understanding, the workflow engine cannot move forward with the event log for this workflow, because it cannot send commands to the application. If it cannot move forward the event log it cannot process the workflow terminate command and the workflow gets stuck retrying any previous command (which in this case was an activity execution most likely)
I don't have sufficient knowledge on the workflow engine to propose a solution but to me it looks like there is a bit of a disconnect between the workflow actor and the engine, the actor tries to send work to the engine so it sends it to the app, but if the engine is not connected to the app nothing works. Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if its safe to terminate the workflow from the backend POV if the connection to the application is absent.
cc @cgillum
Yes, I believe @famarting is correct here. If the worker has disconnected from the sidecar, then it will be unable to receive and process the terminate message, leaving the workflow stuck in the RUNNING
state.
Termination works by sending a message to a workflow. When the workflow receives the terminate message, it transitions itself into a completed state with the TERMINATED
runtime status. The terminate logic is not implemented at the sidecar/engine/actor layer. If you reconnect your worker app to the Dapr sidecar, then the terminate message should get handled and the workflow will terminate.
Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if it's safe to terminate the workflow from the backend POV if the connection to the application is absent.
I think this can be considered as an optimization, but it would need to be implemented carefully to ensure that the workflow state is correctly updated in the same way as when a workflow transitions itself into a terminated state, and that the OTel spans are properly emitted.
Maybe instead of the proposal I made to allow the terminate command to succeed despite of a client being disconnected, what could make more sense is to implement e2e workflow timeouts.
It could be an optional feature where the user can configure a max timeout for the full lifespan of a workflow, if the workflow has not get to a final status by that time, then the backend transitions it to failed automatically
In what area(s)?
/area runtime
What version of Dapr?
1.13.2
Expected Behavior
If a workflow is started, that creates tons of activities. It keeps running forever, but it cannot be terminated. Leaving the workflow in a forever Running state.
Actual Behavior
Workflows, no matter if they are running tons of activities, should be able to be terminated by calling the terminateWorkflow API.
One approach that can be implemented here, is to pause the workflow if it execute activities in a loop to avoid unwanted recursion.
Steps to Reproduce the Problem
0) Install Dapr in a Kubernetes cluster (I am using version 1.13.2 and helm charts) 1) I used Dapr shared but with Dapr Sidecar is the same:
2) Clone https://github.com/salaboy/workflows-bugbash-java 3) Run tests using Maven -> Specifically this test: https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L71 4) Try to terminate the instance created by using the Terminate API -> https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L104 5) Check the status of the workflow after running terminate. It should show as Running
Release Note
RELEASE NOTE: