akka / akka-projection

Akka Projections is intended for building systems with the CQRS pattern, and facilitate in event-based service-to-service communication.
https://doc.akka.io/docs/akka-projection/current/
Other
99 stars 34 forks source link

Projection status #173

Closed jroper closed 4 years ago

jroper commented 4 years ago

There's no direct way of knowing if a projection is currently in a failed or running state. Of course, you can kind of work it out by tracking failure exceptions in your log aggregator, and seeing if the last exception occurred within the max backoff time, but this won't always give you the right answer, just a guess. It would be nice to be able to simply ask "is this projection currently failing?" and get a straight answer. This functionality could be achieved by adding a failure column to the offset store, if zero, it would indicate that the offset was processed successfully, if one or more, it would mean that processing the offset has failed that many times. When loading the offset, care would need to be taken to return the offset before the offset in the store if it was failing. A nice to have feature would be to store details of the exception in the case of failure in the offset store too.

patriknw commented 4 years ago

I don't think this belongs in the Offset table. I understand that this originates from the needs from an admin UI, but for serious troubleshooting I think you would need more than just the current status. That would be overwritten also prevents post mortem analysis.

Would be better as a separate table, like a failure log table.

For Akka Projections I think it would be best to have a pluggable SPI for what to do with failed/retried events. Default implementation can be to just log them with ordinary logging. Others can replace the implementation and for example write them to table that can be used by admin UI.

jroper commented 4 years ago

Of course the logs are necessary for post mortem, this wouldn't replace that. The point is not so much reporting whether a failure happened (that's what logs are for), but rather, saying whether there is currently an event that is failing to be processed (something that the logs can't tell you). Just as important as storing the failure is storing the absence of a failure when its resolved, ie, we need somewhere where the current state of the projection is recorded.

The count of errors and the last error are only conveniences, they allow getting the current error and count of errors without going to the logs, but that's not the core of feature being requested here.

patriknw commented 4 years ago

Ok, but would it be a problem to store that in a separate table to keep the offset table clean of this concern?

Also note that database connectivity failures will be the most common failure for projections and when using the same db as the offset db it will not be possible to update the fail count.

patriknw commented 4 years ago

but it’s not a big deal, we can add the count and err msg to the table

jroper commented 4 years ago

Also note that database connectivity failures will be the most common failure for projections and when using the same db as the offset db it will not be possible to update the fail count.

Yes, this is true - though in those scenarios, you're likely to have triggered other monitoring/alerts, and you're not going to be looking at this status. I think this is more useful for non transient failures, ie, the presence of a failed projection, when everything else is succeeding, tells you a lot. In our system we just find it convenient to be able to instantly query the status of the projections when a user says "X isn't working", much quicker than going to the logs and looking for the absence of specific errors in the last 10 minutes etc.

jroper commented 4 years ago

Ok, but would it be a problem to store that in a separate table to keep the offset table clean of this concern?

That could be done, though the problem is in how to clear a failure, you'll need to read it every time you start the projection, in addition to reading the offset table, so that you know whether you're starting processing from an event that failed on the last attempt, so that if you did, you can then clear the error status when that event succeeds. If you put it in the main offset table, then the query that persists saving the offset can also clear the failure status, without having to execute a second query.

patriknw commented 4 years ago

My thought was to not clear it from that table, but the UI would have to compare that offset with latest offset from offset table. Anyway, it's probably better to keep it simple and add some error columns to the offset table.

jroper commented 4 years ago

Good point with the two tables, that sounds like a reasonable design to me.

Another alternative could be holding the state in memory - though I'm not sure how straight forward that would be in the current design, since the projection itself is managed by a backoff supervisor, so that state would only be known by the supervisor, and so I guess not easily queried. I suppose an additional cluster singleton could be created that serves no purpose other than to hold the last error that was encountered. When the singleton moves, no need for the new singleton to recover its state, if the projection is persistently failing, it will learn soon enough - although the count of failures would become inaccurate, you'd have to store it in a CRDT. And that would be another alternative, storing the last error in a LWWRegister (count of errors would be best effort based on the nodes existing value in the register, you couldn't use a counter because there's no way to reset them, you'd have to have a separate counter for each offset in an ORMap or something which would accumulate garbage). All these approaches would require comparing the last failed offset with the current offset in the offset tracking table, which I think is fine.

Maybe the best approach is a pluggable error handler, it requires very low API and behavior commitment, no schema commitment, allows the greatest flexibility to users, and allows us to provide out of the box implementations in future when we have more user experience/feedback to base a decision on. For my use cases, I'd be quite happy with that approach.

seglo commented 4 years ago

Maybe the best approach is a pluggable error handler, it requires very low API and behavior commitment, no schema commitment, allows the greatest flexibility to users, and allows us to provide out of the box implementations in future when we have more user experience/feedback to base a decision on.

+1 on the pluggable error handler approach. Since this would be useful as a health indicator for monitoring purposes maybe it's something that can be built upon the Telemetry SPI in #53 (issue: #52) in a subsequent PR. It already expresses an onFailure hook to catch stream failure exceptions.

ignasi35 commented 4 years ago

A non-durable alternative could be similar to Lagom's ProjectionRegistry that keeps track (and provides access) to the desired state of each projection: Started/Stopped.

The state is a CRDT as @jroper hinted above.

A ProjectionRegistry would be a useful addition in akka-projection and I think it'd be the right place for the information requested by James: is this projection healthy?


As @seglo points out, some information could already be gathered and managed by a third party via the telemetry SPI. The only issue is that each projection implementation is responsible to use the methods of the telemetry SPI, it's not a transparent feature for the projection developers (it is for the final user).

For failures, the SPI currently includes:

def onFailure(projectionId: ProjectionId, cause: Throwable, systemProvider: ClassicActorSystemProvider): Unit

The telemetry for Lagom projections uses a (very) similar SPI and the default dashboards for Lagom projections telemetry include a failure rate indicator. It's not accessible programmatically but may be good enough for human operators.