Allow critical policies to be recalculated on a different cadence

zhumo commented 1 year ago

Goal

User story
As a Fleet admin,
I want to recalculate a critical policy more frequently than normal policies
so that my security system can keep a closer eye on the critical policies, while minimizing overhead.

Allow users to set recalculation frequency of critical policies separately from normal policies
if not specified, critical policies take on the value of the normal policy
Allow setting this on a per-team basis

Changes

This issue's estimation includes completing:

[ ] UI changes: TODO
[ ] CLI usage changes: TODO
[ ] REST API changes: TODO
[ ] Permissions changes: TODO
[ ] Database schema migrations: TODO
[ ] Outdated documentation changes: TODO
[ ] Scope transparency changes? TODO
[ ] Breaking changes requiring major version bump? TODO
[ ] Changes to paid features or tiers? TODO
[ ] QA complete?
[ ] ...

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

Requestor(s): _____

zayhanlon commented 1 year ago

Docs: https://fleetdm.com/docs/using-fleet/configuration-files#failing-policies-webhook

Interval:

zayhanlon commented 1 year ago

Interval change here and in config?

zayhanlon commented 1 year ago

Hey @ksatter. We're deprioritizing this issue as we won't be able to deliver it in the next 6 weeks. Please bring this back to the PFR call if it surfaces again so we can re-prioritize

noahtalerman commented 1 year ago

Zay: High/medium priority

noahtalerman commented 1 year ago

Air guitar this one.

noahtalerman commented 1 year ago

@zayhanlon heads up, we brought this into the upcoming design sprint as an air guitar.

rachaelshaw commented 11 months ago

@zayhanlon this didn't make it into the sprint, bringing back to Feature Fest

noahtalerman commented 10 months ago

@noahtalerman need to chat w/ customer-schur and customer-ufa to better understand the problem/

@zayhanlon, heads up, this didn't make the 3 week drafting timeline so we're removing it from the drafting board. Bringing back to feature fest.

noahtalerman commented 10 months ago

Heads up @zayhanlon this request was discussed during feature fest last week and didn't make it into the current design sprint.

nonpunctual commented 8 months ago

@noahtalerman @marko-lisica would like to discuss this one if you get a chance to see why it was deprioritized. Too hard? Not enough viable use cases?

noahtalerman commented 8 months ago

@nonpunctual sounds good! Can you please add this as an agenda item to the next product office hours?

noahtalerman commented 8 months ago

Hey @ksatter and @nonpunctual, heads up, we didn't have room to take this one in the current design sprint (4.48).

nonpunctual commented 7 months ago

noahtalerman commented 7 months ago

@nonpunctual is that Slack thread regarding the customer that runs lives queries to detect device health?

If so, I think they want get fresh results every time the user tries to log in to Okta. If this is the case, and using live queries for this is painful, then I don't know if recalculating policies at a faster interval is the right solution.

Instead, maybe the device health API should fetch fresh results each time.

nonpunctual commented 7 months ago

@noahtalerman I am not sure. There are 6 customers attached to this issue all with various reasons for wanting the ability to set an "execution frequency" for policies. This is mostly because the customers know that running large queries on the same interval as everything else is potentially "painful" & that pain could be relieved by running their large queries less frequently.

This need is tied to having fresh data in Fleet instead of stopping after collecting data from the 1st 1000 Hosts Fleet sees. https://github.com/fleetdm/fleet/issues/397

That issue has been open for 2y.

nonpunctual commented 7 months ago

@noahtalerman @alexmitchelliii @ksatter I could be wrong in how I am understanding these customer requests but I don't think this is about getting faster results in Fleet.

It's about setting non-critical policies that collect "static" data to run on a slower interval which will reserve space for the cached data of critical policies that collect "dynamic" data.

Currently, workarounds are to manually run the query policy, run a live query, or have all policies on a shorter interval

Admins want Host data to be up-to-date & easily accessible in the Fleet UI. That is the relationship between this ticket & #397 Expansion of Host Vitals. Customer-ufa said the UI for looking at the info for an individual host is ok but it's not useful for their Help Desk Fleet UI users because Fleet only caches data for the 1st 1000 hosts seen.

If something like a separate cadence for critical & non-critical policies were implemented the data for the critical policies could be fresh & could be limited to, e.g., 25 or 50 critical policies defined by the Fleet admin which would serve as a cap on the amount of cached data.

noahtalerman commented 7 months ago

Customer-ufa said the UI for looking at the info for an individual host is ok but it's not useful for their Help Desk Fleet UI users because Fleet only caches data for the 1st 1000 hosts seen.

@nonpunctual this is great feedback. Thanks.

I think we're getting the cached query results and policy features mixed up here.

With the cached query results (what the customer is interested in), the Fleet admin can already set the frequency on a per query basis. Some queries can run every 5 minutes and other queries can run every hour.

This way, the Fleet admin can protect the performance of their devices (only run intensive queries every so often).

It's about setting non-critical policies that collect "static" data to run on a slower interval which will reserve space for the cached data of critical policies that collect "dynamic" data.

If I'm understand the above correctly, we just want Fleet to be able to get more data (more results) faster w/o having to worry about filling up the Fleet DB.

Ideally, the user doesn't have to worry about the filling up the Fleet DB part. I think we should try hard to make that Fleet's job. The customer can collect as much data as they need and as frequently as they need it.

nonpunctual commented 7 months ago

Not saying I am not confused. The thing I am trying to solve for is this:

Fleet caches data for the 1st 1000 Hosts it sees & stops caching after that.

What would be a lot more useful based on experience & lots of customer feedback is:

Fleet caches data for the most recently seen 1000 or 2000 or 5000 Hosts & lets the old data fall off the edge, like a Time Machine backup, e.g.

noahtalerman commented 5 months ago

Hey @pintomi1989, any new info on the problem we're trying to solve w/ this one? And for what customer?

pintomi1989 commented 5 months ago

Hey @noahtalerman,

This was brought up as a nice to have feature by customer-starchik during this week's meeting. Not a current blocker, but would be awesome to be nice to see critical policies updated on a more prioritized cadence.

fleetdm / fleet