aerospike / aerospike-client-go

Aerospike Client Go
Apache License 2.0
429 stars 199 forks source link

Unbounded go routine spawning for new connections #347

Closed dmarkhas closed 3 years ago

dmarkhas commented 3 years ago

We've recently run into an issue where under high load, the number of gouroutines in our services would skyrocket (from 10k to 100k within seconds), and bring the service to a halt. Looking at the stack trace, all of the rogue goroutines are coming from getConnectionWithHint in node.go:629:

func (nd *Node) getConnectionWithHint(deadline time.Time, timeout time.Duration, hint byte) (conn *Connection, err error) {
...
    if conn == nil {
        go nd.makeConnectionForPool(hint)
        return nil, types.ErrConnectionPoolEmpty
    }
...
}

As you can see, this can (and in fact does) create an infinite amount of goroutines. Even though there are safe-guards against opening too many (concurrent) connections (i.e. OpeningConnectionThreshold and LimitConnectionsToQueueSize), they are only checked after the goroutine has been spawned.

I think it would make more sense to execute these checks before starting a new goroutine (or better yet, limit it as well with a simple worker pool)

khaf commented 3 years ago

Fair enough, although if you're getting that many goroutines, you're probably trying to bite way more than the number of your connections can chew. I'll take care of this ASAP.

dmarkhas commented 3 years ago

I'm still uncertain regarding the underlying cause. It started immediately after migrating our services from Go 1.13 to Go 1.15 (with no other code changes).

We have a single aerospike client that is shared by 10K go routines, it basically looks like this:

    for i := 0; i < 10000; i++ {
        go func() {
            for {
                select {
                case rec := <- recordsChan:
                    if err := client.PutBins(writePolicy, rec.key, rec.bin); err != nil {
                        log.Error("failure saving conversion to cache")
                    }
                case <-ctx.Done():
                    return
                }
            }
        }()
    }

Then, every once in a while we see a huge spike in go routines that are being spawned at go nd.makeConnectionForPool(hint). It's as if all of the connections in the pool become closed for some reason, and then each of the concurrent PutBins needs to create a new connection.

khaf commented 3 years ago

You can check the Client.Stats(). The values may shed some light on the issue.

khaf commented 3 years ago

Today's release (v4.5.0) should mitigate the issue. Please let me know how it goes.

dmarkhas commented 3 years ago

Thanks, we mostly mitigated it by reducing the # of concurrent goroutines to 100 (from 10000) without any significant impact on performance, but the latest fix will definitely help avoiding it in the future as well.