Closed dmarkhas closed 3 years ago
Fair enough, although if you're getting that many goroutines, you're probably trying to bite way more than the number of your connections can chew. I'll take care of this ASAP.
I'm still uncertain regarding the underlying cause. It started immediately after migrating our services from Go 1.13 to Go 1.15 (with no other code changes).
We have a single aerospike client that is shared by 10K go routines, it basically looks like this:
for i := 0; i < 10000; i++ {
go func() {
for {
select {
case rec := <- recordsChan:
if err := client.PutBins(writePolicy, rec.key, rec.bin); err != nil {
log.Error("failure saving conversion to cache")
}
case <-ctx.Done():
return
}
}
}()
}
Then, every once in a while we see a huge spike in go routines that are being spawned at go nd.makeConnectionForPool(hint)
. It's as if all of the connections in the pool become closed for some reason, and then each of the concurrent PutBins
needs to create a new connection.
You can check the Client.Stats()
. The values may shed some light on the issue.
Today's release (v4.5.0) should mitigate the issue. Please let me know how it goes.
Thanks, we mostly mitigated it by reducing the # of concurrent goroutines to 100 (from 10000) without any significant impact on performance, but the latest fix will definitely help avoiding it in the future as well.
We've recently run into an issue where under high load, the number of gouroutines in our services would skyrocket (from 10k to 100k within seconds), and bring the service to a halt. Looking at the stack trace, all of the rogue goroutines are coming from
getConnectionWithHint
innode.go:629
:As you can see, this can (and in fact does) create an infinite amount of goroutines. Even though there are safe-guards against opening too many (concurrent) connections (i.e.
OpeningConnectionThreshold
andLimitConnectionsToQueueSize
), they are only checked after the goroutine has been spawned.I think it would make more sense to execute these checks before starting a new goroutine (or better yet, limit it as well with a simple worker pool)