envato / event_sourcery

A library for building event sourced applications in Ruby
MIT License
84 stars 10 forks source link

ESPRunner shuts down after a child process terminates #215

Closed orien closed 5 years ago

orien commented 5 years ago

We're encountering a problem when running our event processors via the ESPRunner. When one of the event processors running in a child process fails and terminates prematurely, the ESPRunner just ignores the problem. Eventually, our event processor lag monitor will raise the alert to the on-call developer, who in turn can manually restart the ESPRunner.

The process status list looks something like this:

> ps aux
USER     PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
app        1  0.0  1.6  90992 65204 ?        Ssl  03:40   0:02 ruby /app/script/processors
app        2  0.1  1.7  93624 68692 ?        Sl   03:40   0:04 MyApp::Queries::Projector1
app        3  0.0  0.0      0     0 ?        Z    03:40   0:03 [ruby] <defunct>
app        4  0.1  1.6  93640 67660 ?        Sl   03:40   0:05 MyApp::Queries::Projector3

Change

I propose that instead of ignoring the terminated child process, the ESPRunner could optionally consider this a catastrophic failure and (gracefully) stop the remaining child processes, before exiting itself with a status code indicating error. This would allow us to identify the problem earlier, and automatically resolve the problem by restarting the ESPRunner.

Test Documentation

EventSourcery::EventProcessing::ESPRunner
  start!
    starts ESP processes
    graceful shutdown
      upon receiving a TERM signal
        it starts to shutdown
      upon receiving a INT signal
        it starts to shutdown
      sends processes the TERM signal
      exits indicating success
      given shutdown has been requested
        but the processes failed before shutdown
          doesn't send processes the TERM, or KILL signal to the failed process
          exits indicating failure
      given shutdown has not been requested
        and we've requested shutdown if a child process fails
          and the processes fail
            starts the shutdown process after being notified of the failure
            doesn't send processes the TERM, or KILL signal to the failed process
            exits indicating failure
      given the process exits just before sending signal
        doesn't send the signal more than once
        exits indicating failure
      given the process does not terminate until killed
        sends processes the KILL signal
        exits indicating failure

Considerations

I'm not convinced the EventSourcery gem should be responsible for such process management. We should probably be moving this functionality to a gem like Forked.

joesustaric commented 5 years ago

This is an interesting proposal @orien .

To me this could be a little context dependant. If a child process fails do you want / need the main thread to shutdown effecting the other running child processes? I can think of reasons for both sides of that argument.

Could be useful to have an option to toggle this behaviour rather than imposing this as a blanket rule.

I certainly agree that process management feels a little out of scope of the event_sourcery gem as well.

orien commented 5 years ago

That's a good point. I'll make the behaviour optional. That way teams can choose based on their circumstances.

mjward commented 5 years ago

I certainly agree that process management feels a little out of scope of the event_sourcery gem as well.

To my knowledge, literally, all of ~our~ Envato's ES apps handle this a little differently.

mjward commented 5 years ago

@orien do you know if any other teams/services have adopted the grouping similar to what RSS has?

orien commented 5 years ago

Closed in favour of #216.