Reject fast signals when workflow task is failing at continueAsNew

Is your feature request related to a problem? Please describe. When workflow is closing(continueAsNew/fail/complete), the workflow task would fail if there is any signals received but not processed. This is designed to prevent signals being lost(continueAsNew with unhandled signals).

This is problematic especially for continueAsNew. In a valid design, workflow should do continueAsNew after processing too many signals, to avoid blowing up the history. However, when signals are coming too fast, workflow task won't be able to process, and failing in a loop

WFT scheduled/started
WFT fail when attempt to continueAsNew, because a signal is received when WFT is completing
WFT scheduled/started
... As a result, workflow will either get terminated because of history length/size limit (50K events or 50MB) it can't process the signals, or there are too many signals that it can't carry over to next run(limit of workflow input).

Describe the solution you'd like Option 1: DescribeWF response will return why the last workflow task is failing -- is it trying to do a continueAsNew. So that app can use it to decide rejecting the signals Option 2: Temporal server will reject the signals based on last continueAsNew failure. Option 3: instead of failing, apply the buffered signals to next run if workflow is doing continueAsNew, as described in https://github.com/temporalio/temporal/issues/1289

Describe alternatives you've considered A workaround today is that application will describe workflow to check a pending workflow task attempt, and just reject signals when attempt >1 see https://github.com/indeedeng/iwf/issues/236

Additional context

https://temporalio.slack.com/archives/CTQU95E84/p1680537731857359

Thank you for reporting this issue!

This is a very interesting problem that arises from the interplay of signals, continueAsNew, and the history length limit. It's a tricky situation to manage, and your proposed solutions are all good ideas.

At this time, the Temporal Server doesn't have a built-in way to handle the scenario you describe. However, here's a breakdown of your options and some ways to work around the issue for now:

Option 1: You are correct that adding information about why a workflow task failed to DescribeWF would be helpful. This could allow applications to make more informed decisions about how to handle signals. Here's how you could implement that:

Application Workaround: Your application could track the number of workflow task attempts, and if it sees that there are multiple attempts in a short period, it could reject signals until the workflow completes normally or fails. This might prevent a backlog of signals from building up.
Temporal Feature: This is an excellent feature request that we can consider adding to Temporal.

Option 2: Having the Temporal Server reject signals based on the last continueAsNew failure could also be beneficial. This would prevent signal accumulation and potential history size issues.

Application Workaround: A similar approach to option 1, where the application tracks the number of attempts and rejects signals, could be implemented.
Temporal Feature: We can explore this as a possible enhancement in the future.

Option 3: As mentioned in your feature request, the approach described in github.com/temporalio/temporal/issues/1289 could be a good way to address the problem. This would involve buffering signals and applying them to the next workflow run when continueAsNew is used.

Application Workaround: The application could buffer signals internally and send them in batches to the workflow after processing is complete.
Temporal Feature: This is a promising approach we are actively investigating.

We understand this is a pain point, and we appreciate your detailed report and the suggestions for improvement. We are actively working on features to make the management of signals and history size more robust.

In the meantime, the application-level workarounds mentioned above can help to mitigate the issue. We will keep you updated on the progress of any new features related to signal handling and history limits.

dhiaayachi / temporal

Reject fast signals when workflow task is failing at continueAsNew #404