fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.09k stars 426 forks source link

Investigate json encode operation on large loadtest environment #21847

Open xpkoala opened 1 month ago

xpkoala commented 1 month ago

Fleet version: minor-fleet-v4.56.0

Web browser and operating system: n/a


💥  Actual behavior

Seeing cpu usage maxed out on loadtest environment. The web interface is still accessible though a little sluggish. The size of these queries were not present on previous loadtest environments.

🧑‍💻  Steps to reproduce

  1. Default container size (per loadtest instructions) for fleet and loadtest containers
  2. 30 fleet containers
  3. 200 loadtest containers
  4. Create 10 queries that include of 15k characters in the statement; set to run once an hour; logging collected
  5. View health metrics for the fleet containers

🕯️ More info (optional)

Debug archive for the issue can be found here

pprof chart showing the long json encode status

image (1)
xpkoala commented 1 month ago

After removing some of the large saved queries the cpu usage did begin to drop.

Discussed during standup on Sept 5th. Since we currently have no users (as far as we are aware) with a similarly setup environment we are reducing the priority of this as an ~unreleased bug.

This does warrant further investigation.

@xpkoala todo: scale up saved queries once more to achieve 100% cpu utilization and view results when:

remove all large queries and add 100-200 smaller queries (scheduled : 1hr; query reporting enabled) and view results when:

rfairburn commented 1 month ago

Recommendation from a customer when debugging a profile in their environment and seeing high encoding/json cpu time:

have you tried swapping encoding/json for goccy/go-json?

Also a separate recommendation for general performance from the same customer:

ahhh, and prom hooked into http funcs. have you tried setting GOGC really high and setting GOMEMLIMIT? we had an issue with prom metrics and it constantly alloc-ing to the point it hurt performance, GOGC=2000 and GOMEMLIMIT (to the level I expected) basically fixed the problem

Should we consider these settings for environments and auto-allocation in our terraform modules/cloud deployments?

lukeheath commented 1 month ago

@rfairburn Any chance this is contributing to the load issues you're seeing today? If so, I'll prioritize so it gets looked at.