Investigate json encode operation on large loadtest environment

xpkoala commented 2 months ago

Fleet version: minor-fleet-v4.56.0

Web browser and operating system: n/a

💥 Actual behavior

Seeing cpu usage maxed out on loadtest environment. The web interface is still accessible though a little sluggish. The size of these queries were not present on previous loadtest environments.

🧑‍💻 Steps to reproduce

Default container size (per loadtest instructions) for fleet and loadtest containers
30 fleet containers
200 loadtest containers
Create 10 queries that include of 15k characters in the statement; set to run once an hour; logging collected
View health metrics for the fleet containers

🕯️ More info (optional)

This environment is still currently up, please poke @xpkoala for access.
RDS load is within expected values
Memory usage (for fleet containers) is also within expected values.

Debug archive for the issue can be found here

pprof chart showing the long json encode status

xpkoala commented 2 months ago

After removing some of the large saved queries the cpu usage did begin to drop.

Discussed during standup on Sept 5th. Since we currently have no users (as far as we are aware) with a similarly setup environment we are reducing the priority of this as an ~unreleased bug.

This does warrant further investigation.

@xpkoala todo: scale up saved queries once more to achieve 100% cpu utilization and view results when:

query reporting is turned off
queries are not scheduled

remove all large queries and add 100-200 smaller queries (scheduled : 1hr; query reporting enabled) and view results when:

query reporting is turned off
queries are not scheduled.

rfairburn commented 2 months ago

Recommendation from a customer when debugging a profile in their environment and seeing high encoding/json cpu time:

have you tried swapping encoding/json for goccy/go-json?

Also a separate recommendation for general performance from the same customer:

ahhh, and prom hooked into http funcs. have you tried setting GOGC really high and setting GOMEMLIMIT? we had an issue with prom metrics and it constantly alloc-ing to the point it hurt performance, GOGC=2000 and GOMEMLIMIT (to the level I expected) basically fixed the problem

Should we consider these settings for environments and auto-allocation in our terraform modules/cloud deployments?

lukeheath commented 2 months ago

@rfairburn Any chance this is contributing to the load issues you're seeing today? If so, I'll prioritize so it gets looked at.

xpkoala commented 2 weeks ago

After running through the above scenarios it seems like the main culprit is the total character counts across all queries. A large number of queries with small character counts caused similar performance degradation to a small group of queries with very high character counts.

Adding 10 queries that have large character counts (1000+ characters) caused average cpu utilization to jump from ~75% >> ~85%. Performance across the entire app is sluggish and adding new queries can cause loadtest instances to crash and restart.

Turning off query reporting and removing scheduling from these queries had no effect on the performance degradation. And removing this set of 10 queries caused average cpu utilization to return to the ~75% mark.

The loadtest environment was started with roughly 500 small queries, most in the form of select * from osquery_info. After removing 200 of these queries cpu utilization dropped a further 10% from ~75% >> ~65%.

Toggling query reports and adding and removing scheduling from these queries did not appear to have any effect on the cpu utilization.

cc @sharon-fdm

sharon-fdm commented 2 weeks ago

Thanks @xpkoala. What would be the conclusion here then?

xpkoala commented 2 weeks ago

We should investigate optimizations when hitting the queries api.

Since all queries are currently being returned this is most likely causing the performance issues when there is a large amount of content to load (either it be with many smaller queries or a handful of larger queries)

sharon-fdm commented 4 days ago

This is a performance-improve ticket. Could be picked by on-call engineers or prioritized.

fleetdm / fleet