dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Persist shard info based on # of history tasks completed #297

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe. Right now shard info persistence is periodic (by default every 5mins), and history task processing progress is part of shard info (shardInfo.QueueStates).

When load is high on the cluster losing 5mins of task processing progress means lots of reprocessing after shard reload, since we have no idea they are duplicated tasks. This duplication also make task processing rate limit harder.

If instead of time based, we can make the condition based on # of task processed, we'll be able to check point more often when load is high and reduce re-processing.

Describe the solution you'd like

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

dhiaayachi commented 1 month ago

Thank you for your feature request.

We understand that the current periodic shard info persistence can lead to task reprocessing when load is high.

While this feature is not yet available, you can explore the following workarounds:

We are always striving to improve Temporal, and we will consider your feedback as we develop new features.

dhiaayachi commented 1 month ago

Thank you for reporting this issue! It's important for us to optimize task processing performance, and your suggestion to make shard info persistence based on the number of processed tasks instead of time is a valuable one.

While this feature isn't currently available, we can explore alternative solutions. You can adjust the temporal.taskQueuePersistenceInterval parameter in the Temporal server configuration to increase the frequency of shard info persistence. This will reduce the amount of data lost during shard reload. Additionally, you can try increasing the temporal.taskQueueProcessorParallelism parameter to enhance the task processing rate.

We'll keep this suggestion in mind for future development. Please let us know if you have any other questions.

dhiaayachi commented 1 month ago

Thank you for reporting this issue.

You are right, the current periodic shard info persistence can lead to significant reprocessing when the load on the cluster is high.

While a task-based persistence mechanism is not currently available, you can explore these workarounds:

Please let us know if you have any other questions.

dhiaayachi commented 1 month ago

Thank you for your feature request! We understand the importance of minimizing reprocessing during shard reloads, especially under high load conditions.

Currently, Temporal does not offer task-based checkpointing for shard information. You can work around this limitation by reducing the ShardInfoPersistenceInterval to a smaller value. However, this could increase the frequency of shard information persistence, potentially impacting performance.

We appreciate your suggestion and will consider it for future enhancements.