Persist shard info based on # of history tasks completed

dhiaayachi / temporal

Temporal service

https://docs.temporal.io

MIT License

0 stars 0 forks source link

Persist shard info based on # of history tasks completed #297

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe. Right now shard info persistence is periodic (by default every 5mins), and history task processing progress is part of shard info (shardInfo.QueueStates).

When load is high on the cluster losing 5mins of task processing progress means lots of reprocessing after shard reload, since we have no idea they are duplicated tasks. This duplication also make task processing rate limit harder.

If instead of time based, we can make the condition based on # of task processed, we'll be able to check point more often when load is high and reduce re-processing.

Describe the solution you'd like

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

dhiaayachi commented 1 month ago

Thank you for your feature request.

We understand that the current periodic shard info persistence can lead to task reprocessing when load is high.

While this feature is not yet available, you can explore the following workarounds:

Increase the shard info persistence frequency: You can adjust the temporal.task-processing-shard-info-persistence-interval setting to a lower value, e.g., 1 minute, to reduce the amount of task processing progress lost in case of a shard reload.
Implement custom task processing logic: You can implement your own task processing logic to handle task deduplication. This will require more development effort but will provide more granular control over task processing.

We are always striving to improve Temporal, and we will consider your feedback as we develop new features.

dhiaayachi commented 1 month ago

Thank you for reporting this issue! It's important for us to optimize task processing performance, and your suggestion to make shard info persistence based on the number of processed tasks instead of time is a valuable one.

While this feature isn't currently available, we can explore alternative solutions. You can adjust the temporal.taskQueuePersistenceInterval parameter in the Temporal server configuration to increase the frequency of shard info persistence. This will reduce the amount of data lost during shard reload. Additionally, you can try increasing the temporal.taskQueueProcessorParallelism parameter to enhance the task processing rate.

We'll keep this suggestion in mind for future development. Please let us know if you have any other questions.

dhiaayachi commented 1 month ago

Thank you for reporting this issue.

You are right, the current periodic shard info persistence can lead to significant reprocessing when the load on the cluster is high.

While a task-based persistence mechanism is not currently available, you can explore these workarounds:

Increase the shard info persistence frequency: You can adjust the temporal.server.shardInfoPersistenceInterval configuration option to a lower value (e.g., 1 minute) to reduce the amount of processing loss.
Implement a custom persistence mechanism: You can write your own persistence logic to handle shard info updates more frequently based on the number of tasks processed.

Please let us know if you have any other questions.

dhiaayachi commented 1 month ago

Thank you for your feature request! We understand the importance of minimizing reprocessing during shard reloads, especially under high load conditions.

Currently, Temporal does not offer task-based checkpointing for shard information. You can work around this limitation by reducing the ShardInfoPersistenceInterval to a smaller value. However, this could increase the frequency of shard information persistence, potentially impacting performance.

We appreciate your suggestion and will consider it for future enhancements.