coinbase / temporal-ruby

Ruby SDK for Temporal
Apache License 2.0
230 stars 86 forks source link

Timer timing issue #201

Open drewhoskins-stripe opened 1 year ago

drewhoskins-stripe commented 1 year ago

temporal-ruby Timer timing issue

Scenario:

The workflow thus receives two TIMER_FIRED events in its history at once. Now what should happen?

What’s happening in temporal-ruby today

It processes the timers in order. Sleep first, then the other timer. The workflow completes first, and it’s not allowed to do anything after the workflow has completed. Thus, if the timer issues a command, the task will fail with Temporal::WorkflowAlreadyCompletingError.

Attempted solution

My first attempt was to early-out in state_manager.rb

        history_window.events.each do |event|
          apply_event(event)
          break if if workflow_finished?
        end

This fixed my problem and is logical; the workflow will execute the same as it would have if the worker had never gone down. But, it will silently drop events. I asked Maxim what other SDKs do…

Other SDKs’ behaviors

According to Maxim:

Temporal processes new events in batches. One batch per workflow task. It applies all the events before running workflow code. So in case of Java or Go, for example the Futures that correspond to timer will be set to ready before running any workflow threads. This way workflow can check if any of these Futures are ready to decide what to do next.

Signals have similar properties, but they use callbacks in Java and some other SDKs. Temporal SDK schedules callback threads before running workflow threads. This way workflow cannot complete before the callback threads were called.

Then in case of Java SDK it is going to make the main workflow thread eligible to run. But it still will run only after all other threads executed and all other Futures resolved. Without this functionality you guaranteed to lose signals if they are delivered as callbacks.

Rust core does the same

Java implementation - we can a see that callbacks (timer in this case) have a higher priority than the main thread (which is awoken by the sleep in this case).

What now?

Changing temporal-ruby's behavior is obviously somewhat tricky and would cause version changes if not done in a version-safe way. We might take this fix on but want to work with the community.

antstorm commented 1 year ago

Thanks for bringing this up!

I think there might be a relatively painless fix to this. All the concurrent execution threads in Ruby are represented by fibers, so if we can distinguish the callback fibers from the main one in which the workflow was executed and skip resuming it until the very end of the history window. We already do something very similar with event handlers ordering.

A simpler fix that you can do straight away — is to check yourself whether a workflow has finished or not from your time callback using workflow.completed?. Not pretty, but at least will save you from failing workflow tasks.

In term of version safety — I don't think it should affect anyone. At least I don't think there's a non-failure scenario where anyone would depend on this kind of behaviour. It they did they would already be seeing failures