fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.97k stars 413 forks source link

Add visibility to cron failures in production #22368

Open mostlikelee opened 4 days ago

mostlikelee commented 4 days ago

Goal

User story
As a Fleet engineer,
I want to be alerted when failures occur in Fleet server cron jobs
so that I can be proactive in resolving customer issues before they are reported.

Context

Cron jobs tend to fail silently for different reasons in different environments with varying impacts. The following issues are examples of recent customer impacting cron failures where we were not alerted: https://github.com/fleetdm/fleet/issues/22366 https://github.com/fleetdm/fleet/issues/22364 https://github.com/fleetdm/fleet/issues/21292

Ideally we would like to have visibility into failures on cloud hosted AND self hosted environments.

Changes

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
lukeheath commented 4 days ago

@mostlikelee Thanks for filing this. I am prioritizing to the drafting board for estimation and assigning to @sharon-fdm.

Where are you thinking we report failures to? #help-p2?

This is definitely something we need visibility into.

mostlikelee commented 4 days ago

help-p2 seems reasonable or possibly through datadog metrics, or a combination of both.