Closed dterror-zz closed 6 years ago
camus-errors
is generic enough, since we don't actually check what the error was. Now, just heads up, this would also catch things like the "max pull time" which we already alert on and which is not fatal. Might be a bit chatty, but no harm in trying it out
max pull time
dang that's true. It will get chatty, I wonder if I can filter those (I'll try it out).
I've added instrumentation for close-to-earliest-offset
as well.
That's a great idea, but maybe we can change the alert condition? The email is very chatty, and ideally we would filter things like The current offset is too close to the earliest offset, Camus might be falling behind: trekkie.overseer.clickstream uri:tcp://kafka1.data-chi.shopifydc.com:9092 leader:1 partition:1 earliest_offset:187 offset:187 latest_offset:189 avg_msg_size:2906 estimated_size:5812
. What do you think?
I've tried to reduce the noise by adding an absolute threshold on the check for offset too close to earliest offset
log/email. It's not perfect, but it'll exclude cases where the processing is not falling behind, but the topic is very sparse. Thoughts?
File errors and ETL errors that get rescued by the system get logged at the end of the execution. We want to also emit an event to StatsD so we properly track it.
Mmm, so I was able to test it locally but reproducing a kafka reading exception.
@olessia
camus-errors
too generic a metric name?