Address force completion when make a request through CompleteByID with a failure.

dhiaayachi commented 3 weeks ago

Is your feature request related to a problem? Please describe. This is a follow-up feature of this issue.

For an async call, the activity may fail due to a bug or design drawback after a call to an external server. However, temporal server may receive a request from CompleteByID API by other servers while the new attempt for a retryable error of the activity is not started yet. In this case, if the request is to complete the activity, we would complete the activity even the new attempt has not started yet so that we can unblock the workflow (refer this PR). However, if the request is to fail the activity, we may think an appropriate way to handle such cases as we are not sure the failure is to fail the activity or it is a transit error and we want to attempt the activity again.

Describe the solution you'd like For the request to fail an activity: 1) If the request is to force fail an activity, we should fail the activity if the attempt for a retryable error has not started yet. 2) If the request is to fail an activity due to a non-retryable error, we should fail the activity.

Describe alternatives you've considered We may introduce a separate API or a flag for a client to tell the server that it would like a request to force fail the activity.

Additional context Add any other context or screenshots about the feature request here.

dhiaayachi commented 1 week ago

Feature Request: Handle `CompleteByID` requests to fail Activities when a retry is pending

Is your feature request related to a problem? Please describe.

This is a follow-up to this issue regarding handling CompleteByID requests to fail activities when a retry is pending.

Currently, if an async activity fails with a retryable error and the Temporal server receives a CompleteByID request to fail the activity before the retry attempt starts, the server may incorrectly handle this request.

This situation can occur when other servers attempt to fail the activity while the retry is pending. The server may not be able to differentiate between a legitimate failure and a transit error, potentially leading to an unexpected activity failure and blocking the workflow.

Describe the solution you'd like

For requests to fail an activity using CompleteByID:

Force Fail: If the request explicitly indicates a force fail, the server should fail the activity even if the retry attempt for a retryable error has not yet started. This ensures the activity is immediately failed and unblocks the workflow.
Non-Retryable Error: If the request indicates a failure due to a non-retryable error, the server should fail the activity. This handles cases where the activity failed due to a condition that should not be retried.

Describe alternatives you've considered

Introducing a separate API or a flag for clients to indicate a force fail request for an activity is a potential alternative. However, this could lead to API complexity and potential misuse.

Additional context

This enhancement aims to provide more precise control over activity failures when a retry is pending, ensuring that the Temporal server correctly interprets and handles these scenarios. This will improve the robustness and predictability of workflow execution when dealing with asynchronous activities and potential retry situations.

Relevant references:

dhiaayachi commented 1 week ago

Thank you for reporting this issue! This feature request is related to a previous issue https://github.com/temporalio/temporal/issues/987.

As a workaround for this issue, you may create a new API to distinguish force fail requests from requests to fail the activity due to non-retryable errors.

dhiaayachi commented 1 week ago

Thanks for reporting this issue!

This is a known issue and you can find more information here: https://github.com/temporalio/temporal/pull/5724.

If you have any other questions, please let me know.

dhiaayachi commented 1 week ago

Thanks for reporting this issue.

This is a known issue related to the CompleteByID API and retryable errors. You can find more information in the following documentation:

alexseedkou commented 3 days ago

My understanding here is you are trying to ask for a separate API to handle the case to fail an activity due to an error.

During this async call, the client will pass in any error if they would let the server know during the client call here. Then the request will be converted to the corresponding request to fail the activity (for instance here). In this case, the server would behavior accordingly.

dhiaayachi / temporal