fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.92k stars 409 forks source link

Spread out calculation of policies on agents to improve performance when there are lots of policies #12429

Open sharon-fdm opened 1 year ago

sharon-fdm commented 1 year ago

This issue's remaining effort can be completed in ≤1 sprint. It will be valuable even if nothing else ships.

Goal

As the customer of Fleet who uses Fleet policies, I would like Fleet to trigger policy queries on my agents in a way that will use agent's resources in a sustainable way so that the agents will stay healthy and not crash.

Changes

This issue's estimation includes completing:

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

QA

Risk assessment

Risk level: Low / High TODO

Risk description: TODO

Automated:

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming succesful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming succesful completion of QA.
sharon-fdm commented 1 year ago

cc: @zayhanlon @marcosd4h @lucasmrod @zhumo @zwass

zwass commented 1 year ago

One approach to look at might be making osquery take a break between executing distributed queries. IIRC there may actually be a flag that does something like this already?

lucasmrod commented 1 year ago

Yes. --table_delay, though it applies to all queries, not just distributed, and it involves a delay between scans so a query that uses multiple tables will be delayed longer.

https://github.com/osquery/osquery/blob/2e3495837d4fe4db3554c6ef76494bc86c74c099/osquery/sql/virtual_table.cpp#L28C1-L31

(PS: I found it while troubleshooting watchdog killing some of the macOS CIS queries. So this is on my list of things to try as a workaround.)