capistrano / sshkit

A toolkit for deploying code and assets to servers in a repeatable, testable, reliable way.
MIT License
1.15k stars 253 forks source link

Wait for all threads to finish on parallel runner #535

Open djmb opened 6 months ago

djmb commented 6 months ago

If there's an error in one thread, we wait for earlier hosts to complete but not for later ones.

Ensure consistent behaviour across all hosts by saving the exception and joining each thread in turn before returning.

mattbrictson commented 6 months ago

Hi @djmb , could you give me a little more backstory on this proposal?

My interpretation is that SSHKit's parallel runner, in its current form, is effectively "fail-fast". In other words, as soon a host fails, an exception is raised right away. If the exception is not rescued, then the Ruby process exits and any subsequent hosts are not allowed to run to completion.

The change being proposed in this PR is to remove "fail-fast" for in: :parallel execution and instead explicitly wait for all hosts to run to completion. Once all hosts are allowed to complete, then an exception is raised corresponding to the first failure. (However, if in: :groups is used, the first group that fails would still short-circuit subsequent groups and prevent them from running.)

I hesitate to accept this PR as-is because this seems like a significant change that could be considered breaking. Although I was not present for the original creation of SSHKit, I have to assume that the fail-fast behavior was an intentional design decision.

Are there other solutions that wouldn't involve changing the default behavior? Could you register a custom runner for your particular use case? For example, you should be able to set the default using:

SSHKit.config.default_runner = MyRunner

Or as a one-off:

on(hosts, in: MyRunner) do
  # ...
end

Would either of those work?

djmb commented 5 months ago

HI @mattbrictson,

My reasoning for the change is that the current behaviour is not really fail fast - the parallel runner joins the threads in turn so if the second host thread has an exception, we wait for the first host thread, but not the third.

Since we sometimes wait and sometimes do not, it seems more consistent to always wait.

However always waiting is also the behaviour that I'm looking for, so maybe there's some motivated reasoning going on here 😅.

Using the default runner doesn't work completely in my case, as the group runner uses the parallel runner internally so I need create a custom parallel runner and a custom group runner. But I can certainly work around this if we want to keep the current behaviour.