SuperQ / smokeping_prober

Prometheus style smokeping
Apache License 2.0
554 stars 73 forks source link

1,583% CPU Usage- Heavy lock contention #112

Closed Alb0t closed 1 year ago

Alb0t commented 1 year ago
# ps aux|grep smokep|grep -v grep
root      94120 1583  0.0 737732 14968 ?        Ssl  15:52  46:41 /usr/local/bin/smokeping_prober --config.file=/etc/smokeping-exporter/config.yml --buckets=0.0001,0.0002,0.0004

Holy crap!

# smokeping_prober --version
smokeping_prober, version 0.6.1 (branch: HEAD, revision: 8434fd2b1a5584f67f6a5efe70cd851199e7882e)
  build user:       root@b84453c710ff
  build date:       20220608-13:26:13
  go version:       go1.18.3
  platform:         linux/amd64

Machine:

Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz x2, 112 threads
756G memory
Ubuntu18.04, 5.4.0-131-generic
Bare metal.

Smokeping configuration:

---
targets:
- hosts:
    - serverca02.blahblah.net
    - serverca04.blahblah.net
    - serverca08.blahblah.net
    - serverca10.blahblah.net
    - serverca12.blahblah.net
    - serverca14.blahblah.net
    - serverca16.blahblah.net
    - serverca18.blahblah.net
    - serverca20.blahblah.net
    - serverca22.blahblah.net
    - serverca24.blahblah.net
    - serverca26.blahblah.net
    - serverca28.blahblah.net
    - serverca30.blahblah.net
    - serverca32.blahblah.net
    - serverca34.blahblah.net
    - serverca36.blahblah.net
    - servercb30.blahblah.net
    - servercb31.blahblah.net
    - servercb32.blahblah.net
    - servercb33.blahblah.net
    - servercc30.blahblah.net
    - servercc31.blahblah.net
    - servercc32.blahblah.net
    - servercc33.blahblah.net
    - servercd30.blahblah.net
    - servercd31.blahblah.net
    - servercd32.blahblah.net
    - servercd33.blahblah.net
    - serverce30.blahblah.net
    - serverce31.blahblah.net
    - serverce32.blahblah.net
    - serverce33.blahblah.net
    - servercf03.blahblah.net
    - servercf04.blahblah.net
    - servercf05.blahblah.net
    - servercf06.blahblah.net
    - servercf07.blahblah.net
    - servercf08.blahblah.net
    - servercf09.blahblah.net
    - servercf10.blahblah.net
    - servercf11.blahblah.net
    - servercf12.blahblah.net
    - servercf13.blahblah.net
    - servercf14.blahblah.net
    - servercf15.blahblah.net
    - servercf16.blahblah.net
    - servercf17.blahblah.net
    - servercf18.blahblah.net
    - servercf19.blahblah.net
    - servercf20.blahblah.net
    - servercf21.blahblah.net
    - servercf22.blahblah.net
    - servercf23.blahblah.net
    - servercf24.blahblah.net
    - servercf25.blahblah.net
    - servercf26.blahblah.net
    - servercg03.blahblah.net
    - servercg04.blahblah.net
    - servercg05.blahblah.net
    - servercg06.blahblah.net
    - servercg07.blahblah.net
    - servercg08.blahblah.net
    - servercg09.blahblah.net
    - servercg10.blahblah.net
    - servercg11.blahblah.net
    - servercg12.blahblah.net
    - servercg13.blahblah.net
    - servercg14.blahblah.net
    - servercg15.blahblah.net
    - servercg16.blahblah.net
    - servercg17.blahblah.net
    - servercg18.blahblah.net
    - servercg19.blahblah.net
    - servercg20.blahblah.net
    - servercg21.blahblah.net
    - servercg22.blahblah.net
    - servercg23.blahblah.net
    - servercg24.blahblah.net
    - servercg25.blahblah.net
    - servercg26.blahblah.net
    - serverda02.blahblah.net
    - serverda04.blahblah.net
    - serverda06.blahblah.net
    - serverda08.blahblah.net
    - serverda10.blahblah.net
    - serverda12.blahblah.net
    - serverda14.blahblah.net
    - serverda16.blahblah.net
    - serverda18.blahblah.net
    - serverda20.blahblah.net
    - serverda22.blahblah.net
    - serverda24.blahblah.net
    - serverda26.blahblah.net
    - serverda28.blahblah.net
    - serverda30.blahblah.net
    - serverda32.blahblah.net
    - serverda34.blahblah.net
    - serverda36.blahblah.net
    - serverdb02.blahblah.net
    - serverdb04.blahblah.net
    - serverdb06.blahblah.net
    - serverdb08.blahblah.net
    - serverdb10.blahblah.net
    - serverdb12.blahblah.net
    - serverdb14.blahblah.net
    - serverdb16.blahblah.net
    - serverdb18.blahblah.net
    - serverdb20.blahblah.net
    - serverdb22.blahblah.net
    - serverdb24.blahblah.net
    - serverdb26.blahblah.net
    - serverdb28.blahblah.net
    - serverdb30.blahblah.net
    - serverdb32.blahblah.net
    - serverdb34.blahblah.net
    - serverdb36.blahblah.net
    - serverdc02.blahblah.net
    - serverdc04.blahblah.net
    - serverdc06.blahblah.net
    - serverdc08.blahblah.net
    - serverdc10.blahblah.net
    - serverdc12.blahblah.net
    - serverdc14.blahblah.net
    - serverdc16.blahblah.net
    - serverdc18.blahblah.net
    - serverdc20.blahblah.net
    - serverdc22.blahblah.net
    - serverdc24.blahblah.net
    - serverdc26.blahblah.net
    - serverdc28.blahblah.net
    - serverdc30.blahblah.net
    - serverdc32.blahblah.net
    - serverdc34.blahblah.net
    - serverdc36.blahblah.net
    - serverdh02.blahblah.net
    - serverdh04.blahblah.net
    - serverdh06.blahblah.net
    - serverdh08.blahblah.net
    - serverdh10.blahblah.net
    - serverdh12.blahblah.net
    - serverdh14.blahblah.net
    - serverdh16.blahblah.net
    - serverdh18.blahblah.net
    - serverdh20.blahblah.net
    - serverdh22.blahblah.net
    - serverdh24.blahblah.net
    - serverdh26.blahblah.net
    - serverdh28.blahblah.net
    - serverdh30.blahblah.net
    - serverdh32.blahblah.net
    - serverdh34.blahblah.net
    - serverdh36.blahblah.net
    - serverdi02.blahblah.net
    - serverdi04.blahblah.net
    - serverdi06.blahblah.net
    - serverdi08.blahblah.net
    - serverdi10.blahblah.net
    - serverdi12.blahblah.net
    - serverdi14.blahblah.net
    - serverdi16.blahblah.net
    - serverdi18.blahblah.net
    - serverdi22.blahblah.net
    - serverdi24.blahblah.net
    - serverdi26.blahblah.net
    - serverdi28.blahblah.net
    - serverdi30.blahblah.net
    - serverdi32.blahblah.net
    - serverdi34.blahblah.net
    - serverdi36.blahblah.net
    - serverdj04.blahblah.net
    - serverdj06.blahblah.net
    - serverdj08.blahblah.net
    - serverdj10.blahblah.net
    - serverdj12.blahblah.net
    - serverdj14.blahblah.net
    - serverdj16.blahblah.net
    - serverdj18.blahblah.net
    - serverdj22.blahblah.net
    - serverdj24.blahblah.net
    - serverdj26.blahblah.net
    - serverdj28.blahblah.net
    - serverdj30.blahblah.net
    - serverdj32.blahblah.net
    - serverdj34.blahblah.net
    - serverdj36.blahblah.net
  interval: 1s # Duration, Default 1s.
  network: ip # One of ip, ip4, ip6. Default: ip (automatic IPv4/IPv6)
  protocol: icmp # One of icmp, udp. Default: icmp (Requires privileged operation). 
  size: 56 # Packet data size in bytes. Default 56 (Range: 24 - 65535)

What I've tried so far:

Files: pprofs.tar.gz

For right now I have set the interval from 1s to 5s, and GOGC to 900 which brought CPU util from 1600% to 90%

SuperQ commented 1 year ago

Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz x2, 112 threads 756G memory

This is likely your problem. Because you have such a huge server, Go is trying to parallelize everything over every CPU on your system simultaneously. This is likely to cause a lot of CPU cache thrashing.

I would recommend trying something like setting GOMAXPROCS=8 env var to constrain Go to only 8 CPUs at a time. This way each of the probe goroutines will use the Go cooperative multi-tasking on fewer posix threads. This of course could bottleneck things, so you should check process_cpu_seconds_total to make sure you're not maxed out.

Changing timeout from 1s to 10s halves the CPU usage from ~1600% to ~700%.

There is no timeout in the prober, since there is no support for per-packet timeouts in the ping library yet. Did you mean interval?

Alb0t commented 1 year ago

Yes, interval. You nailed it on the cause. Setting GOMAXPROCS does the trick and now CPU util reported as 45%. If I set this to 1 I don't think I'm getting contention. rate(process_cpu_seconds_total{job="smokeping",instance="bleh9374"}[$__rate_interval])*100 is returning about 44% which matches the ps returned value.

image

Appreciate your help on this!