AbsaOSS / spline

Data Lineage Tracking And Visualization Solution
https://absaoss.github.io/spline/
Apache License 2.0
588 stars 154 forks source link

Kafka :: message failure handling #1279

Open wajda opened 9 months ago

wajda commented 9 months ago

Problem

At the moment the Kafka consumer is implemented in a way that all messages are automatically acknowledged regardless if the processing was successful or not. This would lead to data loss in case of database failure, or a processing timeout due to the message size, for example.

Goal

We need to implement a robust error handling strategy.

Solution proposal

Based on the error type, we can decide if the error is persistent or temporary and apply different logic for each type:

Persistent errors

The error is considered persistent if it's evident or very likely that the error would re-occur on any retry processing attempt in the future. For example, the message is malformed or invalid, or it contradicts the current state of the database (e.g. constraint violation) . In such cases we would acknowledge the message and send it to the dead letters topic, where it would sit and wait manual intervention.

Temporary errors

It the database is experiencing issues, an unexpected error happens in the app code, or a message dependency aren't fulfilled, we know that the situation that lead to the error is rather temporary and might be fixed soon. In that case we would acknowledge the message and send it to a retry topic, from where the messaged would be automatically pulled and retried periodically.

cerveada commented 9 months ago

At the moment the Kafka consumer is implemented in a way that all messages are automatically acknowledged regardless if the processing was successful or not.

Are you sure? I thought it will not acknowledge the message if the underlying layer throws an error.

wajda commented 9 months ago

Yes, I'm sure. I specifically verified it yesterday.