Supervision strategies and Error Recovery of Graph Actors

jan-g commented 7 years ago

There are some external considerations that may lead to a graph actor panicking: persistence DB connectivity problems in particular, which are transient (but we still want to know about them).

The supervisor should be able to schedule actor rematerialisations in the future (perhaps this is better off handled by a helper actor, if a plugin for this isn't already available). It should receive "try waking this up again" messages for graph actors with some jittery exponential backoff (up to a maximum limit). We need some notion of "stability period" after which we consider an actor to have been successfully restarted. Behaviour at the maximum backoff should be to retry indefinitely(? maybe - tbd).

We should expose counters for any actors that we are rematerialising via prometheus - those numbers going up may be a signal that we need some operator oversight.

Finally, there's a gotcha in the current architecture if we do this. Currently a panicked actor will fail any stages that were under execution in fn; we assume we're recovering on system restart. For a transiently panicking graph, there may be an extant executor goroutine out there (with the PID of the panicked actor?) which still holds open a connection to fn. TBD: Should we (a) deal with failing those fired stages differently? (b) arrange it so that an executor goroutine can route a message to the new graph actor?

gviedma-zz commented 7 years ago

There's a few different failure cases that could lead to a graph actor panicking:

Actor fails to persist a message due to a DB connection error
Actor fails to persist a message due to an event marshalling error
An unexpected runtime error in the actor or its associated graph data structure.

The main issue I see with implementing a backoff strategy for 1) is that it breaks transactional semantics with respect to the flow client on the runtime side. Our model so far has been that flow graph topology changes are transactional, so that when the client appends a new stage the request either succeeds or fails after persisting the event. Implementing a backoff policy for actor materialization could result in the server's graph actor and client's flow manifestations drifting away from the state persisted to storage, leading to inconsistencies if the graph actor were to die. Also, it is unclear who would be responsible for retrying failed writes to the database (this is not currently implemented by the persistence plugin infrastructure).

I think there is value in maintaining the current transactional semantics since it provides consistency guarantees and makes it easy to reason about the state of flows. I therefore think it might be safest to fail-fast in case of transient DB connection issues and stop the actor. This would result in the client-side flow request timing out and the error bubbling up to the runtime to deal with in an application-specific way. It also means that we would need to mark (and persist) a graph/flow as failed/inconsistent, to prevent further mutations at a later point. A way to do this would be to have the persistence layer callback to the failed actor analogously to Akka's onPersistFailure.

Regarding 2) and 3) I think again these indicate an inconsistent state of the actor after which further processing would be unsafe. Stopping the actor seems to me the safest way to deal with these cases as well.

gviedma-zz commented 6 years ago

Also related to this is https://github.com/fnproject/flow/issues/115

fnproject / flow

Supervision strategies and Error Recovery of Graph Actors #109