This change filters out activity and timer cancellation commands before sending back the result of the workflow task to Temporal server. It does this when those commands are associated with activities or timers that have already been canceled.
Motivation
Existing code in future.rb already ensures that cancellation is a no-op on activity and timer futures that have already completed at that point in workflow execution. However, this doesn't cover all cases where an activity or timer is being canceled in the same workflow task where it is completing or failing. Consider the following scenario:
In an earlier workflow task:
An activity is scheduled by starting an activity in workflow code
In a later workflow task:
A signal is received, which has a handler that cancels the activity. This will produce a RequestCancelActivityTaskCommand which is put in a list to be sent back to the Temporal server.
The activity completes or fails
Because the future will still not be complete when the signal is received, it will be canceled and the command produced. However, later in the processing of the history window, we encounter a history event indicating that cancellation is not valid because the activity has already finished. When the RequestCancelActivityTaskCommand is sent to Temporal server, it will reject it as invalid and retry the workflow task. This will continue indefinitely, putting the workflow in a "stuck" state.
Testing
Adds a new integration spec that reproduces this race condition
Adds some new unit specs for the state manager that precisely tests this filtering behavior
Summary
This change filters out activity and timer cancellation commands before sending back the result of the workflow task to Temporal server. It does this when those commands are associated with activities or timers that have already been canceled.
Motivation
Existing code in
future.rb
already ensures that cancellation is a no-op on activity and timer futures that have already completed at that point in workflow execution. However, this doesn't cover all cases where an activity or timer is being canceled in the same workflow task where it is completing or failing. Consider the following scenario:In an earlier workflow task:
In a later workflow task:
RequestCancelActivityTaskCommand
which is put in a list to be sent back to the Temporal server.Because the future will still not be complete when the signal is received, it will be canceled and the command produced. However, later in the processing of the history window, we encounter a history event indicating that cancellation is not valid because the activity has already finished. When the
RequestCancelActivityTaskCommand
is sent to Temporal server, it will reject it as invalid and retry the workflow task. This will continue indefinitely, putting the workflow in a "stuck" state.Testing