Closed marcosd4h closed 1 year ago
@marcosd4h Have you run into any issues with the Windows host breaking with large amounts of policies? https://github.com/fleetdm/fleet/issues/12031
@xpkoala Not really, I ran the test with no issues. I'm looking into the logs provided in #12031 to see if I can find anything to explain the behavior your are seeing
I profiled orbit and osqueryd processes using the tool added on PR12090 and found that the CPU and Memory KPIs are not affected when policies queries are being performed by Fleet server. See usage charts below
process_profile_cpu_orbit
process_profile_cpu_osqueryd
process_profile_memory_orbit
process_profile_memory_osqueryd
Below is a zoomed version of the worker osqueryd CPU consumption during policies execution. CPU consumption is about 12% for about 10 secs
It is worth mentioning that the osquery watchdog was disabled so osqueryd and orbit processes could be profiled without any restarts - See issue #11797 for more details. Below are the flags used in osquery.flags
file to disable the watchdog
--watchdog_level=-1
I took the estimation off and updated the description a bit.
We want to re-run the test but identify which CIS queries are causing high usage.
@lucasmrod "Hard to tell with @marcosd4h tools to find which queries are the expensive ones, the script lucas wrote is for macos, see if it's able to use on windows which might be helpful to see the expensive ones"
Estimate: 3 pts (2 pts + 1 point for updating script that can find expensive queries)
Potential engineers to assign this to: @juan-fdz-hawa (has a real computer instead of just a VM) @marcosd4h @xpkoala (has a real computer instead of just a VM - needs guide or script is created already that can help)
@lucasmrod @RachelElysia @xpkoala @juan-fdz-hawa
Guys, here are two important comments about this issue
Chiming in as I'm watching this one. We experience the issues both in the CIS benchmarks and our own queries for macOS platform. But that was before the CIS query enhancements that I think are coming next release.
Sharon will modify the test code to print highest resource consuming policy/ies @xpkoala will run the tests on real Windows machines.
@sharon-fdm Is this one something Reed would be working on now? If so, I put it in the estimated column.
@zhumo I still have some fix to do before @xpkoala can run the tests. I may need to consult with @marcosd4h for that.
OK thanks. It as in "confirm and celebrate," but I suspected it should not, because there is still work to be done. So I've moved it to estimated.
@zhumo In order to test individual tests load, it was significantly easier to measure the run time of the queries assuming high run time would mean high load.
The results will be published by @xpkoala soon.
Because there might be sensitive information in the output the files are put into our drive account. I'm still trying to get a cleaned up file that will just contain the query which was run and the associated time taken to execute but I do not believe, after a few perusals, that any query took over 600ms (0.6 seconds).
https://drive.google.com/file/d/1GU81JQ-Himebo51U4XGklQPmQYgidkME/view?usp=sharing, https://drive.google.com/file/d/1T2zwv5R271gEtjEma1qyy5YN_DNvEgaW/view?usp=sharing
Conclusion: there are no specifically problematic queries.
Orbit's dance with CIS, In glass city memory, flows. Performance blossoms.
Can this be re-opened until a sanitized version of the output is put here? I'd like to audit this personally before we enable this feature at scale.
Hi @sharon-fdm @xpkoala could you provide the sanitized data to Erik?
My apologies. We will provide more information soon and keep this open until then.
I apologize for closing this one out, I had mistaken it for another ticket. I'll attach a new set of logs here in the next 24 hours.
@erikng I appreciate your patience on this and I apologize that I don't have the final results for you with this update.
@lucasmrod and I are continuing to work on a test methodology that will more accurately represent your environment and provide detailed information to defend the results of the test.
We are planning on starting a 24+ hour run tomorrow which will allow us to give you results either late Friday or early next week. I'll keep this thread updated with our progress as we move forward.
Thanks Reed.
Confirm and celebrate: @zhumo do we want docs for this? Does this PR cover it? https://github.com/fleetdm/fleet/pull/13799/files
Yes it does! Let's close at the next C&C
Benchmarks twirl like leaves, Windows host in cloud city, Memory breathes free.
Goal
Measure the application performance of orbit/osqueryd when performing policies required by the Windows CIS benchmarks.
Fleet server should have the Windows CIS policies applied and the Windows host should be part of the fleet and have Orbit installed.
KPIs to measure:
CPU performance
Memory usage
Identify which CIS benchmark queries take a lot of memory and require optimization.
The procedure used to collect these results should be documented for future usage.