Open xpkoala opened 2 months ago
After removing some of the large saved queries the cpu usage did begin to drop.
Discussed during standup on Sept 5th. Since we currently have no users (as far as we are aware) with a similarly setup environment we are reducing the priority of this as an ~unreleased bug
.
This does warrant further investigation.
@xpkoala todo: scale up saved queries once more to achieve 100% cpu utilization and view results when:
remove all large queries and add 100-200 smaller queries (scheduled : 1hr; query reporting enabled) and view results when:
Recommendation from a customer when debugging a profile in their environment and seeing high encoding/json cpu time:
have you tried swapping encoding/json for goccy/go-json?
Also a separate recommendation for general performance from the same customer:
ahhh, and prom hooked into http funcs. have you tried setting GOGC really high and setting GOMEMLIMIT? we had an issue with prom metrics and it constantly alloc-ing to the point it hurt performance, GOGC=2000 and GOMEMLIMIT (to the level I expected) basically fixed the problem
Should we consider these settings for environments and auto-allocation in our terraform modules/cloud deployments?
@rfairburn Any chance this is contributing to the load issues you're seeing today? If so, I'll prioritize so it gets looked at.
After running through the above scenarios it seems like the main culprit is the total character counts across all queries. A large number of queries with small character counts caused similar performance degradation to a small group of queries with very high character counts.
Adding 10 queries that have large character counts (1000+ characters) caused average cpu utilization to jump from ~75% >> ~85%. Performance across the entire app is sluggish and adding new queries can cause loadtest instances to crash and restart.
Turning off query reporting and removing scheduling from these queries had no effect on the performance degradation. And removing this set of 10 queries caused average cpu utilization to return to the ~75% mark.
The loadtest environment was started with roughly 500 small queries, most in the form of select * from osquery_info
. After removing 200 of these queries cpu utilization dropped a further 10% from ~75% >> ~65%.
Toggling query reports and adding and removing scheduling from these queries did not appear to have any effect on the cpu utilization.
cc @sharon-fdm
Thanks @xpkoala. What would be the conclusion here then?
We should investigate optimizations when hitting the queries api.
Since all queries are currently being returned this is most likely causing the performance issues when there is a large amount of content to load (either it be with many smaller queries or a handful of larger queries)
This is a performance-improve ticket. Could be picked by on-call engineers or prioritized.
Fleet version:
minor-fleet-v4.56.0
Web browser and operating system: n/a
💥 Actual behavior
Seeing cpu usage maxed out on loadtest environment. The web interface is still accessible though a little sluggish. The size of these queries were not present on previous loadtest environments.
🧑💻 Steps to reproduce
🕯️ More info (optional)
Debug archive for the issue can be found here
pprof chart showing the long json encode status