Sending StatsD events for non-fatal errors

Shopify / camus

Kafka->HDFS pipeline from LInkedIn. It is a mapreduce job that does distributed data loads out of Kafka.

7 stars 4 forks source link

Sending StatsD events for non-fatal errors #135

Closed dterror-zz closed 6 years ago

dterror-zz commented 6 years ago

File errors and ETL errors that get rescued by the system get logged at the end of the execution. We want to also emit an event to StatsD so we properly track it.

Mmm, so I was able to test it locally but reproducing a kafka reading exception.

@olessia camus-errors too generic a metric name?

olessia commented 6 years ago

camus-errors is generic enough, since we don't actually check what the error was. Now, just heads up, this would also catch things like the "max pull time" which we already alert on and which is not fatal. Might be a bit chatty, but no harm in trying it out

dterror-zz commented 6 years ago

max pull time dang that's true. It will get chatty, I wonder if I can filter those (I'll try it out).

dterror-zz commented 6 years ago

I've added instrumentation for close-to-earliest-offset as well.

olessia commented 6 years ago

That's a great idea, but maybe we can change the alert condition? The email is very chatty, and ideally we would filter things like The current offset is too close to the earliest offset, Camus might be falling behind: trekkie.overseer.clickstream uri:tcp://kafka1.data-chi.shopifydc.com:9092 leader:1 partition:1 earliest_offset:187 offset:187 latest_offset:189 avg_msg_size:2906 estimated_size:5812. What do you think?

dterror-zz commented 6 years ago

I've tried to reduce the noise by adding an absolute threshold on the check for offset too close to earliest offset log/email. It's not perfect, but it'll exclude cases where the processing is not falling behind, but the topic is very sparse. Thoughts?