envato / event_sourcery

A library for building event sourced applications in Ruby
MIT License
84 stars 10 forks source link

Processor foo died with Unable to get a lock on foo 3. #221

Open berkes opened 4 years ago

berkes commented 4 years ago

By default, when a projector encounters an error and crashes, the processor hangs and keeps retrying.

It is then often imposibble to fix and rerun because of stale locks, when using the postgres addon.

I assume this is mostly by design, and an effect of how Postgresql handles locks and times these out.

In any case, the error is not very helpful and nothing hints at how to fix this. Three problems:

  1. Unable to get a lock on foo 3.

What should I do? Wait for the lock to timeout? This seems undocumented. Can I force a lock to dissapear? Is this only happening with Postgres?

Maybe it's just a case of documenting this somewhere?

  1. The unconfigured runner keeps repeating this.

And therefore refreshing the lock, It appears closing down all postgres clients (including the processor) and waiting a while often fixes it. But I cannot reproduce this consistently.

This is probably solved by a smartly placed ensure that releases locks on projector errors. I guess.

  1. The endless loop from 2. buries the error deep down under (tens of) thousands of backtraces.

The log gets really noisy, when stuck in the endless loop. Within minutes many thousands of similar backtraces accumulate. Finding the original error in there is a matter of being fast - e.g. before the disk fills up or the on-screen logging is rotated.

In this state, hitting ^C to terminate the worker often does not register; the loop handling the worker is running so fast that it often fails to register a shut-down-signal. It sometimes takes a minute or more before the workers are actually shut down, causing another several thousands of log-entries if unlucky.

This is probably solved by configuring the ESP runner with the proper error-handler and have it not retry by default?