fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 430 forks source link

Maintenance windows every week #19031

Closed lukeheath closed 3 months ago

lukeheath commented 6 months ago

Goal

User story
As an IT admin,
I want to see maintenance windows weekly
so that I can resolve all high and critical vulnerabilities within 15 days.

Context

Many Fleet instances must resolve all high and critical vulnerabilities within 15 days. The current patch schedule of once per month does not meet these requirements unless a patch is issued within 15 days of the last Tuesday of the month. That means in order to be compliant, these Fleet instances cannot use the calendar feature and must rely on notifying and forcing the end user.

Changes

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Manual testing steps

  1. Enable calendar integration and a failing policy that creates calendar event(s).
  2. Check to make sure the event was created on the next Tuesday.

Testing notes

Confirmation

  1. [x] Engineer (@getvictor): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
noahtalerman commented 6 months ago

@lukeheath thanks for tracking this.

Once per week, every two weeks, every three, or last Tuesday (four weeks).

If these were the options, which do you think Fleet would choose when dogfooding?

lukeheath commented 6 months ago

@noahtalerman I would choose one week. Reasons:

  1. We remediate all vulnerabilities within 15 days of detection.
  2. If we choose two weeks, and all vulnerabilities are not resolved on all devices during the maintenance window, we may not have time to manually resolve them before 15 days pass if the vulnerabilities were discovered on the first day of the maintenance cycle.

The only way we can confidently remediate within 15 days is to schedule maintenance windows weekly.

@spokanemac We welcome your input! Do you agree with my thoughts above?

spokanemac commented 6 months ago

@lukeheath Yes, agreed on selecting one week as the interval.

@noahtalerman I would add that with weekly maintenance windows, we have the opportunity for two windows on a user calendar to remediate over 15 days, with a few days to intervene manually.

This also helps account for potential OOO situations where the host may be offline for a week.

noahtalerman commented 3 months ago

Hey @lukeheath, @Drew-P-drawers, and @spokanemac heads up that we shipped this maintenance windows improvement.

It looks like have an article about the feature here but there's no mention of the old timing (every 3rd Tuesday).

Are there any other guides/articles that need to be updated?

TODO @noahtalerman:

spokanemac commented 3 months ago

@noahtalerman No other guides at this time. My dogfooding article is still a WIP.

noahtalerman commented 3 months ago

Update the maintenance windows diagram in Figma here so it says every week instead of every 3rd Tuesday

@spokanemac and @lukeheath, instead of updating the existing flow chart, I created a v2 of the flow chart and link to it in this issues description.

Here's my understanding of the behavior after shipping story:

Screenshot 2024-07-24 at 9 16 42 AM

@getvictor is that accurate? If that's right can you please close this issue? Thanks :)

getvictor commented 3 months ago

Should be:

If it’s Tuesday and it’s past the last slot, schedule the event for the next business day.

If the webhook already fired but policy is still failing, schedule the event for the next Tuesday. Grace period of 1 day after the webhook fires before scheduling another calendar event.

But what should happen if host was offline during the event? In that case, we try to reschedule the event for the same day. This is how we end up with an event every hour. image

noahtalerman commented 3 months ago

@getvictor thanks! I updated the behavior. Please let me know if that looks right:

Screenshot 2024-07-25 at 9 34 19 AM

But what should happen if host was offline during the event?

I think we decided to exit instead of scheduling more events for that day. From the flowchart in Figma here:

Screenshot 2024-07-25 at 9 33 31 AM

noahtalerman commented 3 months ago

@getvictor giving you an extra ping^ :)

I think we decided to exit instead of scheduling more events for that day. From the flowchart in Figma here

Did we change this behavior as part of a story I'm forgetting? (I definitely could be). If not , then I think we can track a bug for this.

getvictor commented 3 months ago

@getvictor giving you an extra ping^ :)

I think we decided to exit instead of scheduling more events for that day. From the flowchart in Figma here

Did we change this behavior as part of a story I'm forgetting? (I definitely could be). If not , then I think we can track a bug for this.

We do exit. But 5 minutes later we enter this flowchart again with another cron run.

noahtalerman commented 3 months ago

We do exit. But 5 minutes later we enter this flowchart again with another cron run.

Ah, makes sense. I think that's ok for now.

I updated the flowchart here to make sure that this it's documented somewhere.

FYI @sharon-fdm, because you're working on the guide for scheduled maintenance. I think it makes sense to call this behavior out in the guide (along with a summary of the flowchart).

fleet-release commented 3 months ago

Weekly window comes, Vulnerabilities addressed, Safe in the cloud's arms.

sharon-fdm commented 3 months ago

@noahtalerman your flowchart link in this msg is broken.

noahtalerman commented 3 months ago

Thanks for the heads up @sharon-fdm! I fixed the link: https://www.figma.com/design/AeCMzgaSqN4DXzTrKxvdYh/%2319031-Maintenance-windows-every-week?node-id=2-130&t=QQNKwOc7xgnvqx1v-1