fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 429 forks source link

Alert #help-p1 or #help-p2 when there are cron failures in managed cloud #19930

Open roperzh opened 4 months ago

roperzh commented 4 months ago

Problem

Cron failures in managed cloud do not report to the #help-p1 or #help-p2 channels, which means we're not aware of them unless issues are reported by the customer.

Solution

  1. Define the cron jobs we want to be alerted of when they fail and if they should alert #help-p1 or #help-p2.
    • We probably want to know when any cron fails at least in #help-p2.
    • For #help-p1, we want the critical things like SCEP certificate renewals.
  2. Ensure those cron jobs are writting a standardized formatted message to the logs.
  3. Setup Cloudwatch Notifications (support from @rfairburn as needed) to trigger alerts in the #help-p1 or #help-p2 channel.
    • Will need this defined in Terraform. We'll want to do something like Cloudwatch Logs > Lamba > SNS topic.
roperzh commented 4 months ago

cc: @lukeheath @georgekarrv

lukeheath commented 4 months ago

@roperzh Thanks for submitting this and good idea and seems necessary. I'm prioritizing to the drafting board for estimation.

lukeheath commented 4 months ago

@georgekarrv I'm prioritizing this for estimation. If it requires assistance from infra please loop in the necessary folks.

@sharon-fdm Do y'all have any cron job alerts that should be generating #help-p1 alerts? If so, please add them to this issue description so they can be included.

@rfairburn Heads up that you may get some questions about this. My primary goal is just to make sure that any cron alerts that should generate #help-p1 alerts (like SCEP certificate) do generate alerts.

Thanks all!

rfairburn commented 4 months ago

This should be possible, but an entirely different mechanism will need to be added to the terraform monitoring module.

I am thinking the pattern would look like this:

Cloudwatch subscription filter -> lambda function (to process) -> SNS Topic (alert #help-p1 slack)

I'd need to flesh the specifics out, but I think it is definitely possible as long as we have easy-to match patterns that don't have ambiguity in what I am matching.

This would be separate from our existing cron monitoring which just checks specific cron jobs for their completed status in the db.

sharon-fdm commented 4 months ago

@lukeheath I am not aware of any alerts mechanism for any of our cron jobs. @getvictor @mostlikelee @lucasmrod do you know otherwise?

Also, @mostlikelee do we have any failure notification when one of our vuln repos fail to do its job?

lukeheath commented 4 months ago

@sharon-fdm Yeah, I expect there isn't any hooked up yet. MDM ran into a case where the Fleet server knew the SCEP certificate was expired, but there was no mechanism to alert beyond server logs (from my 50k' view). Since that broke MDM functionality, we would have wanted to know about it in #help-p1.

I'm wondering if EO has anything like that, where an error that would otherwise be logged to the server is something that we'd actually like to know about ASAP in #help-p1. There may not be anything like that for EO since y'all don't deal with as many certs as MDM.

getvictor commented 4 months ago

@sharon-fdm Normal GitHub actions workflows can do a Slack notification on fail, like: https://github.com/fleetdm/fleet/blob/223e1f23620df9f12bc523ed0ce0ab6cb0d29ae2/.github/workflows/test-go.yaml#L147

roperzh commented 4 months ago

Hey folks, sorry for not being clear (I created the issue in a hurry)

I updated the title now, as Luke mentions, this is to surface cron errors that happen in Cloud, as they currently don't have proper visibility (so we can't use GitHub actions, etc)

If you have any crons that are mission critical and want to surface alerts, this is the place to ask for them :)

mostlikelee commented 4 months ago

Also, @mostlikelee do we have any failure notification when one of our vuln repos fail to do its job? yes, this alerts to #help-p2

I can't think of anything critical in EO, but I think it would be a good metric to monitor cron failure counts.

georgekarrv commented 4 months ago

Hey team! Please add your planning poker estimate with Zenhub @dantecatalfamo @ghernandez345 @gillespi314 @mna @roperzh @jahzielv

iansltx commented 3 months ago

Based on @ksatter in https://fleetdm.slack.com/archives/C051QJU3D0V/p1723414521461399?thread_ts=1723413693.587369&cid=C051QJU3D0V, we need to mark jobs as something other than "completed" (e.g. "failed") when they fail, at which point we can filter on that + cron in CloudWatch Logs Insights to get something that we can trigger alerting on.

If we had had that alerting on the vulnerabilities cron (including marking as failed), I think we would've caught #21239 Friday afternoon. If that's enough reason to pull this into the sprint, happy to take it as a way to learn more about that part of the codebase.

sgress454 commented 6 days ago

After reading the comments so far and looking at the current implementation, I have a slightly different proposal. I like the idea of a "failed" status for the cron jobs, but instead of trying to correlate a failed job with a log message, how about adding a new field to the cron_stats table (e.g. "failed_reason" or "notes")? Then we can update the existing monitoring Lambda to check for new failed jobs, and send an SNS message directly from the Lambda using the persisted failure reason. This Lambda already reads from the table and sends SNS messages so the new functionality would be in keeping with the existing use. The only gap I see is that the current setup only provides for configuring a single SNS topic which is directed to #help-p1, so if we wanted failures from certain jobs to go to #help-p2 we'd need to set up a new topic in Terraform and account for it in the Lambda code.

rfairburn commented 5 days ago

I like the idea of the db table recording the status. It would make processing in the lambda much easier.

We should still -also- include failures in the logs for those outside of cloud/aws not running the lambda, however.