Dapr Workflows cannot be terminated if they are running lots of activities

salaboy commented 2 months ago

In what area(s)?

/area runtime

/area operator

/area placement

/area docs

/area test-and-release

What version of Dapr?

1.13.2

1.0.x edge: output of git describe --dirty

Expected Behavior

If a workflow is started, that creates tons of activities. It keeps running forever, but it cannot be terminated. Leaving the workflow in a forever Running state.

Actual Behavior

Workflows, no matter if they are running tons of activities, should be able to be terminated by calling the terminateWorkflow API.

One approach that can be implemented here, is to pause the workflow if it execute activities in a loop to avoid unwanted recursion.

Steps to Reproduce the Problem

0) Install Dapr in a Kubernetes cluster (I am using version 1.13.2 and helm charts) 1) I used Dapr shared but with Dapr Sidecar is the same:

helm install my-workflows-app oci://registry-1.docker.io/daprio/dapr-shared-chart --set shared.appId=my-workflow-app --set shared.daprd.image.tag=1.13.2 --set shared.strategy=deployment

kubectl port-forward svc/my-workflow-app-dapr 50001:50001

2) Clone https://github.com/salaboy/workflows-bugbash-java 3) Run tests using Maven -> Specifically this test: https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L71 4) Try to terminate the instance created by using the Terminate API -> https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L104 5) Check the status of the workflow after running terminate. It should show as Running

Release Note

RELEASE NOTE:

famarting commented 2 months ago

if you look at the dapr sidecar logs you will see entries like

WARN[0513] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0573] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0613] Activity actor '108adc75-08df-494b-99ec-65735f690802::1::1': 'run-activity' is still running - will keep waiting until '2024-04-25 11:33:18.632479 +0200 CEST m=+3613.908154293'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2

what is happening with this test it that it starts the worker that connects via grpc with the dapr sidecar, and then it schedules the workflow so it starts running, and as soon as the workflow starts running your test exits which also exits the worker and the grpc connection to the sidecar closes.

To my understanding, the workflow engine cannot move forward with the event log for this workflow, because it cannot send commands to the application. If it cannot move forward the event log it cannot process the workflow terminate command and the workflow gets stuck retrying any previous command (which in this case was an activity execution most likely)

I don't have sufficient knowledge on the workflow engine to propose a solution but to me it looks like there is a bit of a disconnect between the workflow actor and the engine, the actor tries to send work to the engine so it sends it to the app, but if the engine is not connected to the app nothing works. Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if its safe to terminate the workflow from the backend POV if the connection to the application is absent.

olitomlinson commented 2 months ago

cc @cgillum

cgillum commented 2 months ago

Yes, I believe @famarting is correct here. If the worker has disconnected from the sidecar, then it will be unable to receive and process the terminate message, leaving the workflow stuck in the RUNNING state.

Termination works by sending a message to a workflow. When the workflow receives the terminate message, it transitions itself into a completed state with the TERMINATED runtime status. The terminate logic is not implemented at the sidecar/engine/actor layer. If you reconnect your worker app to the Dapr sidecar, then the terminate message should get handled and the workflow will terminate.

Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if it's safe to terminate the workflow from the backend POV if the connection to the application is absent.

I think this can be considered as an optimization, but it would need to be implemented carefully to ensure that the workflow state is correctly updated in the same way as when a workflow transitions itself into a terminated state, and that the OTel spans are properly emitted.

famarting commented 1 week ago

Maybe instead of the proposal I made to allow the terminate command to succeed despite of a client being disconnected, what could make more sense is to implement e2e workflow timeouts.

It could be an optional feature where the user can configure a max timeout for the full lifespan of a workflow, if the workflow has not get to a final status by that time, then the backend transitions it to failed automatically

dapr / dapr