dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Automatically split history event batches when size of reapplied events are too large #190

Open dhiaayachi opened 2 months ago

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe. When doing reset, signals and updates after the reset point will be reapplied (cherry-picked) to the new run. However all those reapplied events are grouped into one batch. Our persistence layer has a validation that basically says each event batch can't be exceeded 4MB size limit (each batch is a separate call to persistence). This means if the size of reapplied events is larger than 4MB, the reset can't be done.

This issue from my understanding only applies to reset, where events from more than one event batches in the base workflow can be picked. The events reapply logic during conflict resolution is triggered by the replication task of a single batch of event, so we won't run into the situation.

Describe the solution you'd like Reset with more than 4MB reapplied events should be supported.

Approach 1:

Approach 2:

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

dhiaayachi commented 1 month ago

Feature Request: Support Large Event Batches During Reset

Is your feature request related to a problem? Please describe.

Currently, resetting a workflow with a large number of reapplied signals and updates after the reset point can fail due to the persistence layer's 4MB event batch size limit. This issue occurs because the reapplied events are grouped into a single batch, potentially exceeding the limit.

Describe the solution you'd like

To address this, we propose two possible solutions:

Approach 1: Automatic Batch Creation

Approach 2: Reset-Specific Batching

Describe alternatives you've considered

Additional context

This issue only affects resets, where events from multiple batches in the original workflow can be reapplied. The event reapply logic during conflict resolution, triggered by replication tasks, operates on a single event batch, so this limit is not encountered in that scenario.

References:

Benefits of Implementing This Feature:

dhiaayachi commented 1 month ago

Thank you for reporting this issue. This is a known issue with Temporal's DefaultTransactionSizeLimit which is 4 MB. Unfortunately, this limit applies to all events persisted to the history, including those events which are reapplied during a reset. As a result, workflows with reapplied events exceeding the limit cannot be reset.

To work around this issue, you can consider decreasing the size of your reapplied events, or breaking your workflows into smaller units that would keep the events persisted during a reset under 4 MB.

We appreciate you raising this issue, and we will consider solutions like implementing automatic batch splitting in future versions of Temporal. You can track the progress on this issue in our GitHub repository.

dhiaayachi commented 1 month ago

Thank you for reporting this issue. The 4MB limit for the event batch size during reset is a known limitation of the Temporal service. You can find more information about the default DefaultTransactionSizeLimit in the Temporal documentation.

There are a few approaches that might work to mitigate the issue while we explore a solution:

  1. Reduce the size of the reapplied events: This could involve optimizing the data being passed in the events or potentially reducing the frequency of updates during the workflow.
  2. Adjust the DefaultTransactionSizeLimit: You can increase the DefaultTransactionSizeLimit by adjusting the limit.defaultTransactionSizeLimit dynamic config variable in your Temporal service. It's important to note that increasing this limit might have performance implications.
  3. Consider batching events: If feasible, you could try to batch the events before applying them during the reset process.

We appreciate your understanding and will work to find a more comprehensive solution.

dhiaayachi commented 1 month ago

Thank you for reporting this issue.

It seems like you are experiencing an issue with reset functionality when the size of reapplied events exceeds the DefaultTransactionSizeLimit (4MB) of the persistence layer.

The DefaultTransactionSizeLimit is a hard limit that ensures the persistence layer can handle events in a batch.

You are correct that the issue only applies to reset, not conflict resolution.

Here are a few things you can try:

  1. Reduce the size of the events being reapplied: If possible, try to reduce the size of the events being reapplied during the reset process. This might involve optimizing the data being sent in the events.
  2. Increase the DefaultTransactionSizeLimit: You can increase the DefaultTransactionSizeLimit by adjusting the configuration of the Temporal server.
  3. Break the reapplied events into multiple batches: If the size of the events is too large to fit in a single batch, you might need to break the reapplied events into multiple batches and apply them individually.

Here are some relevant resources:

Let me know if you have any further questions.

dhiaayachi commented 1 month ago

Thank you for reporting this issue.

As you mentioned, this issue only applies to reset, and is not related to the replication task that handles conflict resolution.

The DefaultTransactionSizeLimit for persistence is indeed 4 MB, and you are experiencing this limit during reset because all the reapplied events are grouped into one batch.

There are a few potential workarounds you could consider:

Please let us know if you have any other questions or if you need further assistance.

dhiaayachi commented 1 month ago

Thank you for reporting this issue.

You are correct that the current implementation of ResetWorkflowExecution does not handle batches larger than 4 MB.

The documentation regarding the "Event batch size" limit can be found here: https://docs.temporal.io/self-hosted-guide/defaults.

We are actively working on a solution to address this issue. We will update the documentation with the details of the solution once it is available.

In the meantime, you may consider using smaller batches for your reset operations. This can be achieved by adjusting your workflow logic to generate smaller event batches.