dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Reject fast signals when workflow task is failing at continueAsNew #404

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe. When workflow is closing(continueAsNew/fail/complete), the workflow task would fail if there is any signals received but not processed. This is designed to prevent signals being lost(continueAsNew with unhandled signals).

This is problematic especially for continueAsNew. In a valid design, workflow should do continueAsNew after processing too many signals, to avoid blowing up the history. However, when signals are coming too fast, workflow task won't be able to process, and failing in a loop

Describe the solution you'd like Option 1: DescribeWF response will return why the last workflow task is failing -- is it trying to do a continueAsNew. So that app can use it to decide rejecting the signals Option 2: Temporal server will reject the signals based on last continueAsNew failure. Option 3: instead of failing, apply the buffered signals to next run if workflow is doing continueAsNew, as described in https://github.com/temporalio/temporal/issues/1289

Describe alternatives you've considered A workaround today is that application will describe workflow to check a pending workflow task attempt, and just reject signals when attempt >1 see https://github.com/indeedeng/iwf/issues/236

Additional context

https://temporalio.slack.com/archives/CTQU95E84/p1680537731857359

dhiaayachi commented 2 months ago

Thank you for reporting this issue!

This is a very interesting problem that arises from the interplay of signals, continueAsNew, and the history length limit. It's a tricky situation to manage, and your proposed solutions are all good ideas.

At this time, the Temporal Server doesn't have a built-in way to handle the scenario you describe. However, here's a breakdown of your options and some ways to work around the issue for now:

Option 1: You are correct that adding information about why a workflow task failed to DescribeWF would be helpful. This could allow applications to make more informed decisions about how to handle signals. Here's how you could implement that:

Option 2: Having the Temporal Server reject signals based on the last continueAsNew failure could also be beneficial. This would prevent signal accumulation and potential history size issues.

Option 3: As mentioned in your feature request, the approach described in github.com/temporalio/temporal/issues/1289 could be a good way to address the problem. This would involve buffering signals and applying them to the next workflow run when continueAsNew is used.

We understand this is a pain point, and we appreciate your detailed report and the suggestions for improvement. We are actively working on features to make the management of signals and history size more robust.

In the meantime, the application-level workarounds mentioned above can help to mitigate the issue. We will keep you updated on the progress of any new features related to signal handling and history limits.