ZipRecruiter / cloudwatching

Apache License 2.0
8 stars 7 forks source link

Exporter no longer emitting metrics after eu-central-1 event #6

Closed justin-watkinson-sp closed 4 years ago

justin-watkinson-sp commented 4 years ago

Hey there!

Wanted to get this on the radar, haven't had a chance to dig, but got a report today from one of our folks that our RDS data was suddenly missing from Prometheus. Looking back at the timeline, it coincides with when the eu-central-1 event on Tuesday.

Restarting the exporter container restored service, so I guess the takeaway here is to either fail a health check or somehow convince the exporter to exit when it gets stuck. We run multiple configs in different exporters, so the one impacted containers AWS ES, ELB, ALB, and RDS in the config. Guessing the APIs were unresponsive and got the exporter into a strange state.

frioux commented 4 years ago

It might be good to surface the last time updated and suggest people alert if that is too far in the past. I can add that; if you find more info please pass it along. Maybe we somehow are missing some timeouts?

-- Sent from a rotary phone rented from Ma Bell

On Thu, Nov 14, 2019, 8:48 PM justin-watkinson-sp notifications@github.com wrote:

Hey there!

Wanted to get this on the radar, haven't had a chance to dig, but got a report today from one of our folks that our RDS data was suddenly missing from Prometheus. Looking back at the timeline, it coincides with when the eu-central-1 event on Tuesday.

Restarting the exporter container restored service, so I guess the takeaway here is to either fail a health check or somehow convince the exporter to exit when it gets stuck. We run multiple configs in different exporters, so the one impacted containers AWS ES, ELB, ALB, and RDS in the config. Guessing the APIs were unresponsive and got the exporter into a strange state.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ZipRecruiter/cloudwatching/issues/6?email_source=notifications&email_token=AAAB7Y6TJ3YAZ7NDYOQPN63QTWMRPA5CNFSM4JNP72AKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HZM6N5A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7Y7ZGUD2H3AH7G63GGTQTWMRPANCNFSM4JNP72AA .

justin-watkinson-sp commented 4 years ago

Yeah I was just reading up on the absent function which can also be used for this for an alarm (without any extra instrumentation). If nothing else, that could live as a "README" best practice.

I don't have debug logs on in production and when I looked there was just the start-up message about listening and that's it.

justin-watkinson-sp commented 4 years ago

Wanted to give an update. After staring at some code for a bit, I didn't see any obvious way this could have happened, as that Fatal log on error to refresh should pretty much catch everything, and had we run into that, I should have seen the container bounce several times, but didn't.

After studying the missing AWS metrics with things like the up metric, it became apparent that the application did exit correctly, and upon being spun back up, our Prometheus discovery mechanism (which is file based) timed out, likely due to the AWS event. The retries were exceeded, so it was likely producing viable metrics, with no Prometheus to know about it, which explains the missing data.

That said, all provenance points to this not being an issue with cloudwatching. Apologies for the false alarm.

frioux commented 4 years ago

Thanks for the in depth report! I really appreciate it

-- Sent from a rotary phone rented from Ma Bell

On Sat, Nov 16, 2019, 2:17 PM justin-watkinson-sp notifications@github.com wrote:

Wanted to give an update. After staring at some code for a bit, I didn't see any obvious way this could have happened, as that Fatal log on error to refresh should pretty much catch everything, and had we run into that, I should have seen the container bounce several times, but didn't.

After studying the missing AWS metrics with things like the up metric, it became apparent that the application did exit correctly, and upon being spun back up, our Prometheus discovery mechanism (which is file based) timed out, likely due to the AWS event. The retries were exceeded, so it was likely producing viable metrics, with no Prometheus to know about it, which explains the missing data.

That said, all provenance points to this not being an issue with cloudwatching. Apologies for the false alarm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZipRecruiter/cloudwatching/issues/6?email_source=notifications&email_token=AAAB7Y5AEGM7P6F4IHIQGB3QUBWOVA5CNFSM4JNP72AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEH3ZNA#issuecomment-554679476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7Y34EC3AZUF3DJOCFBTQUBWOVANCNFSM4JNP72AA .