fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 431 forks source link

Windows CIS benchmarks - Loadtest Windows Host #11939

Closed marcosd4h closed 1 year ago

marcosd4h commented 1 year ago

Goal

Measure the application performance of orbit/osqueryd when performing policies required by the Windows CIS benchmarks.

Fleet server should have the Windows CIS policies applied and the Windows host should be part of the fleet and have Orbit installed.

KPIs to measure:

The procedure used to collect these results should be documented for future usage.

xpkoala commented 1 year ago

@marcosd4h Have you run into any issues with the Windows host breaking with large amounts of policies? https://github.com/fleetdm/fleet/issues/12031

marcosd4h commented 1 year ago

@xpkoala Not really, I ran the test with no issues. I'm looking into the logs provided in #12031 to see if I can find anything to explain the behavior your are seeing

marcosd4h commented 1 year ago

I profiled orbit and osqueryd processes using the tool added on PR12090 and found that the CPU and Memory KPIs are not affected when policies queries are being performed by Fleet server. See usage charts below

process_profile_cpu_orbit process_profile_cpu_orbit

process_profile_cpu_osqueryd process_profile_cpu_osqueryd

process_profile_memory_orbit process_profile_memory_orbit

process_profile_memory_osqueryd process_profile_memory_osqueryd

marcosd4h commented 1 year ago

Below is a zoomed version of the worker osqueryd CPU consumption during policies execution. CPU consumption is about 12% for about 10 secs

process_profile_cpu_11864_issue

process_profile_cpu_33068_issue

marcosd4h commented 1 year ago

It is worth mentioning that the osquery watchdog was disabled so osqueryd and orbit processes could be profiled without any restarts - See issue #11797 for more details. Below are the flags used in osquery.flags file to disable the watchdog --watchdog_level=-1

zhumo commented 1 year ago

I took the estimation off and updated the description a bit.

We want to re-run the test but identify which CIS queries are causing high usage.

RachelElysia commented 1 year ago

@lucasmrod "Hard to tell with @marcosd4h tools to find which queries are the expensive ones, the script lucas wrote is for macos, see if it's able to use on windows which might be helpful to see the expensive ones"

Estimate: 3 pts (2 pts + 1 point for updating script that can find expensive queries)

Potential engineers to assign this to: @juan-fdz-hawa (has a real computer instead of just a VM) @marcosd4h @xpkoala (has a real computer instead of just a VM - needs guide or script is created already that can help)

marcosd4h commented 1 year ago

@lucasmrod @RachelElysia @xpkoala @juan-fdz-hawa

Guys, here are two important comments about this issue

  1. The customer reported that they are NOT using the CIS queries. The issue manifested just using the queries that they have built over time. Please consider this, as finding the expensive CIS queries might not be helpful here.
  2. The issue, IMO is caused by the restrictive osquery watchdog constraints. This could be tweaked at installation time (not sure if it could also be tweaked at runtime). The watchdog will trigger if CPU consumption goes above the predefined threshold over a period of time. Osquery will execute queries one after the other once received. As we don't control the queries a customer might run on the endpoint, it would be good if these queries get scheduled at different times.
erikng commented 1 year ago

Chiming in as I'm watching this one. We experience the issues both in the CIS benchmarks and our own queries for macOS platform. But that was before the CIS query enhancements that I think are coming next release.

sharon-fdm commented 1 year ago

Sharon will modify the test code to print highest resource consuming policy/ies @xpkoala will run the tests on real Windows machines.

zhumo commented 1 year ago

@sharon-fdm Is this one something Reed would be working on now? If so, I put it in the estimated column.

sharon-fdm commented 1 year ago

@zhumo I still have some fix to do before @xpkoala can run the tests. I may need to consult with @marcosd4h for that.

zhumo commented 1 year ago

OK thanks. It as in "confirm and celebrate," but I suspected it should not, because there is still work to be done. So I've moved it to estimated.

sharon-fdm commented 1 year ago

@zhumo In order to test individual tests load, it was significantly easier to measure the run time of the queries assuming high run time would mean high load.

The results will be published by @xpkoala soon.

xpkoala commented 1 year ago

Because there might be sensitive information in the output the files are put into our drive account. I'm still trying to get a cleaned up file that will just contain the query which was run and the associated time taken to execute but I do not believe, after a few perusals, that any query took over 600ms (0.6 seconds).

https://drive.google.com/file/d/1GU81JQ-Himebo51U4XGklQPmQYgidkME/view?usp=sharing, https://drive.google.com/file/d/1T2zwv5R271gEtjEma1qyy5YN_DNvEgaW/view?usp=sharing

sharon-fdm commented 1 year ago

Conclusion: there are no specifically problematic queries.

fleet-release commented 1 year ago

Orbit's dance with CIS, In glass city memory, flows. Performance blossoms.

erikng commented 1 year ago

Can this be re-opened until a sanitized version of the output is put here? I'd like to audit this personally before we enable this feature at scale.

zhumo commented 1 year ago

Hi @sharon-fdm @xpkoala could you provide the sanitized data to Erik?

sharon-fdm commented 1 year ago

My apologies. We will provide more information soon and keep this open until then.

xpkoala commented 1 year ago

I apologize for closing this one out, I had mistaken it for another ticket. I'll attach a new set of logs here in the next 24 hours.

xpkoala commented 1 year ago

@erikng I appreciate your patience on this and I apologize that I don't have the final results for you with this update.

@lucasmrod and I are continuing to work on a test methodology that will more accurately represent your environment and provide detailed information to defend the results of the test.

We are planning on starting a 24+ hour run tomorrow which will allow us to give you results either late Friday or early next week. I'll keep this thread updated with our progress as we move forward.

erikng commented 1 year ago

Thanks Reed.

noahtalerman commented 1 year ago

Confirm and celebrate: @zhumo do we want docs for this? Does this PR cover it? https://github.com/fleetdm/fleet/pull/13799/files

zhumo commented 1 year ago

Yes it does! Let's close at the next C&C

fleet-release commented 1 year ago

Benchmarks twirl like leaves, Windows host in cloud city, Memory breathes free.