fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.68k stars 383 forks source link

Allow critical policies to be recalculated on a different cadence #11492

Open zhumo opened 1 year ago

zhumo commented 1 year ago

Goal

User story
As a Fleet admin,
I want to recalculate a critical policy more frequently than normal policies
so that my security system can keep a closer eye on the critical policies, while minimizing overhead.

Changes

This issue's estimation includes completing:

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

zayhanlon commented 1 year ago

Docs: https://fleetdm.com/docs/using-fleet/configuration-files#failing-policies-webhook

Interval: image

zayhanlon commented 1 year ago

Interval change here and in config? image

zayhanlon commented 1 year ago

Hey @ksatter. We're deprioritizing this issue as we won't be able to deliver it in the next 6 weeks. Please bring this back to the PFR call if it surfaces again so we can re-prioritize

noahtalerman commented 8 months ago

Zay: High/medium priority

noahtalerman commented 8 months ago

Air guitar this one.

noahtalerman commented 8 months ago

@zayhanlon heads up, we brought this into the upcoming design sprint as an air guitar.

rachaelshaw commented 7 months ago

@zayhanlon this didn't make it into the sprint, bringing back to Feature Fest

noahtalerman commented 6 months ago

@noahtalerman need to chat w/ customer-schur and customer-ufa to better understand the problem/

@zayhanlon, heads up, this didn't make the 3 week drafting timeline so we're removing it from the drafting board. Bringing back to feature fest.

noahtalerman commented 6 months ago

Heads up @zayhanlon this request was discussed during feature fest last week and didn't make it into the current design sprint.

nonpunctual commented 4 months ago

@noahtalerman @marko-lisica would like to discuss this one if you get a chance to see why it was deprioritized. Too hard? Not enough viable use cases?

noahtalerman commented 4 months ago

@nonpunctual sounds good! Can you please add this as an agenda item to the next product office hours?

noahtalerman commented 4 months ago

Hey @ksatter and @nonpunctual, heads up, we didn't have room to take this one in the current design sprint (4.48).

nonpunctual commented 3 months ago
Screenshot 2024-03-22 at 2 15 28 PM
noahtalerman commented 3 months ago

@nonpunctual is that Slack thread regarding the customer that runs lives queries to detect device health?

If so, I think they want get fresh results every time the user tries to log in to Okta. If this is the case, and using live queries for this is painful, then I don't know if recalculating policies at a faster interval is the right solution.

Instead, maybe the device health API should fetch fresh results each time.

nonpunctual commented 3 months ago

@noahtalerman I am not sure. There are 6 customers attached to this issue all with various reasons for wanting the ability to set an "execution frequency" for policies. This is mostly because the customers know that running large queries on the same interval as everything else is potentially "painful" & that pain could be relieved by running their large queries less frequently.

This need is tied to having fresh data in Fleet instead of stopping after collecting data from the 1st 1000 Hosts Fleet sees. https://github.com/fleetdm/fleet/issues/397

That issue has been open for 2y.

nonpunctual commented 3 months ago

@noahtalerman @alexmitchelliii @ksatter I could be wrong in how I am understanding these customer requests but I don't think this is about getting faster results in Fleet.

It's about setting non-critical policies that collect "static" data to run on a slower interval which will reserve space for the cached data of critical policies that collect "dynamic" data.

Currently, workarounds are to manually run the query policy, run a live query, or have all policies on a shorter interval

Admins want Host data to be up-to-date & easily accessible in the Fleet UI. That is the relationship between this ticket & #397 Expansion of Host Vitals. Customer-ufa said the UI for looking at the info for an individual host is ok but it's not useful for their Help Desk Fleet UI users because Fleet only caches data for the 1st 1000 hosts seen.

If something like a separate cadence for critical & non-critical policies were implemented the data for the critical policies could be fresh & could be limited to, e.g., 25 or 50 critical policies defined by the Fleet admin which would serve as a cap on the amount of cached data.

noahtalerman commented 3 months ago

Customer-ufa said the UI for looking at the info for an individual host is ok but it's not useful for their Help Desk Fleet UI users because Fleet only caches data for the 1st 1000 hosts seen.

@nonpunctual this is great feedback. Thanks.

I think we're getting the cached query results and policy features mixed up here.

With the cached query results (what the customer is interested in), the Fleet admin can already set the frequency on a per query basis. Some queries can run every 5 minutes and other queries can run every hour.

This way, the Fleet admin can protect the performance of their devices (only run intensive queries every so often).

It's about setting non-critical policies that collect "static" data to run on a slower interval which will reserve space for the cached data of critical policies that collect "dynamic" data.

If I'm understand the above correctly, we just want Fleet to be able to get more data (more results) faster w/o having to worry about filling up the Fleet DB.

Ideally, the user doesn't have to worry about the filling up the Fleet DB part. I think we should try hard to make that Fleet's job. The customer can collect as much data as they need and as frequently as they need it.

nonpunctual commented 3 months ago

Not saying I am not confused. The thing I am trying to solve for is this:

Fleet caches data for the 1st 1000 Hosts it sees & stops caching after that.

What would be a lot more useful based on experience & lots of customer feedback is:

Fleet caches data for the most recently seen 1000 or 2000 or 5000 Hosts & lets the old data fall off the edge, like a Time Machine backup, e.g.

noahtalerman commented 1 month ago

Hey @pintomi1989, any new info on the problem we're trying to solve w/ this one? And for what customer?

pintomi1989 commented 1 month ago

Hey @noahtalerman,

This was brought up as a nice to have feature by customer-starchik during this week's meeting. Not a current blocker, but would be awesome to be nice to see critical policies updated on a more prioritized cadence.