Closed Alb0t closed 1 year ago
Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz x2, 112 threads 756G memory
This is likely your problem. Because you have such a huge server, Go is trying to parallelize everything over every CPU on your system simultaneously. This is likely to cause a lot of CPU cache thrashing.
I would recommend trying something like setting GOMAXPROCS=8
env var to constrain Go to only 8 CPUs at a time. This way each of the probe goroutines will use the Go cooperative multi-tasking on fewer posix threads. This of course could bottleneck things, so you should check process_cpu_seconds_total
to make sure you're not maxed out.
Changing timeout from 1s to 10s halves the CPU usage from ~1600% to ~700%.
There is no timeout in the prober, since there is no support for per-packet timeouts in the ping library yet. Did you mean interval?
Yes, interval. You nailed it on the cause. Setting GOMAXPROCS does the trick and now CPU util reported as 45%. If I set this to 1 I don't think I'm getting contention.
rate(process_cpu_seconds_total{job="smokeping",instance="bleh9374"}[$__rate_interval])*100
is returning about 44% which matches the ps returned value.
Appreciate your help on this!
Holy crap!
Machine:
Smokeping configuration:
What I've tried so far:
GOOS=linux GOARCH=amd64 go build
, copied to host, saw the same issue. Performed a new trace. Also attempted to rungo get -u ./...
and saw the same behavior.Files: pprofs.tar.gz
For right now I have set the interval from 1s to 5s, and GOGC to 900 which brought CPU util from 1600% to 90%