gebn / bmc_exporter

Exposes Baseboard Management Controller data in Prometheus format.
GNU Lesser General Public License v3.0
45 stars 3 forks source link

Goroutine leak(s) #34

Closed gebn closed 4 years ago

gebn commented 4 years ago

There is known to be one when requests are abandoned (K8s ingress restart).

Another only reveals itself when one exporter is running.

gebn commented 4 years ago

Pprof to the rescue. There are at least 2 leaks. The first was lots of goroutines stuck sending a scrape request to the Target's channel in ServeHTTP(). By grepping for the pointer, it emerged it was actually lots of goroutines for a small number of targets - these BMCs were simply slow. Running a single exporter created more contention, exacerbating the problem. The issue is the send has no way to terminate, even if Prometheus abandons the scrape. Interestingly, no single goroutine was >45 mins old; either this was the backlog, or there was some timeout (I presume the former). This queue caused the second leak, where lots of goroutines were stuck in waitpoll. As the request was still being served, the request goroutine could not terminate.

It's unclear whether the K8s ingress restart triggered this condition, or is a separate problem. The ingress will be restarted again at some point after this is fixed, so we can close this and keep an eye on goroutines, opening another issue later if there is still a bug.