dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Allow update timeout/retry config for started activities with retry #398

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe

There are two use cases that this feature will be super helpful for Temporal applications:

Use case 1: Making async activity completion more reliable and efficient.

Async activity completion is very handy/powerful/useful in certain scenario, when a workflow want to invoke an external action, and then wait for a response from external. This could be done by activity + signal, however, async activity completion could be more natural, less code to write, and more efficient (less events in history).

However, the current async activity completion has a potential reliability issue.

The activity needs to pass down the activity taskToken(or at least the activityId) to external system so that the activity task can be completed by sending the token/taskId to Temporal service. So the async activity has be be implemented this below sequence. (see doc )

activityMethodForAsyncCompletion(ctx Context ){
    // Retrieve the activity information needed to asynchronously complete the activity.
    activityInfo := cadence.GetActivityInfo(ctx)
    taskToken := activityInfo.TaskToken

    // Send the taskToken to the external service that will complete the activity.
    ...

    // Return from the activity a function indicating that Cadence should wait for an async completion
    // message.
    return "", activity.ErrResultPending
}

But anything could happen in Send the taskToken to the external service that will complete the activity. There could be a timeout, or a worker sudden disruption. Note that right now SDK doesn't do anything to activity.ErrResultPending (just not respond anything to server )

When the failure happens, if the start to close timeout is very long, then the external system will not receive the token until next retry.

As a workaround, application has to set very short start to close timeout, and then keep retrying on the activity. This leads to issues: a. If the external system uses a async mechanism like message queue to receive the token, then it will receive a lot of duplicates b. If the external system use a database to store it, then additional logic must be implemented with atomic/traditional operation for adding the token. The workaround not along cause a lot of complication, but also very in-efficient to do. Because in reality , this is a very extremely rare case (but we can't ignore it for some critical system, and it's against the Temporal's strong consistency guarantee).

One solution to this is to let activity set a smaller heartbeat timeout initially, and then update the heartbeat timeout to infinite()) after successfully sending the token to external system.

Use case 2: Allow using different backoff retry/timeout strategy for different errors

Right now the backoff retry and policy is set as a global for all types of errors. This is inconvenient for some use cases that some retriable error can do a much bigger backoff than others.

If allowing update the timeout config, then SDK can return back a different config based on different errors returned from the application(activity results).

Describe the solution you'd like**

The started activities with retry have the timeout/retry config stored in mutable state, it should be straightforward to introduce an API to allow updating it. Also, the activity retry won't write down started even until closing so it should cause any problem for history replay.

Describe alternatives you've considered**

For use case 1(async activity), an alternative could be letting SDK send back the activity.ErrResultPending to server, so that server can stop retry for the activity. Right now SDK doesn't do anything to activity.ErrResultPending (just not respond anything to server )

Additional context**

https://temporalio.slack.com/archives/CTQU95E84/p1681159067543939