DiceDB / dice

DiceDB is a redis-compliant, reactive, scalable, highly-available, unified cache optimized for modern hardware.
https://dicedb.io/
Other
6.83k stars 1.08k forks source link

Bug - Goroutine Leaks in Worker #1297

Closed psrvere closed 1 day ago

psrvere commented 3 days ago

What is the issue?

While running integration tests in resp server, I noticed goroutines leaks. As integration test progress, number of goroutines keep on increasing arbitrarily.

Steps to reproduce

Use pprof for checking goroutine traces. Add it in TestMain function like this

import (
    "net/http"
    _ "net/http/pprof"
)

func TestMain(m *testing.M) {
    go func() {
        fmt.Println(http.ListenAndServe("localhost:6060", nil))
    }()

This will now let us see profile at this dashboard http://localhost:6060/debug/pprof/ at any given point in the program. Add time.Sleep to pause the program at a particular point, preferably at the test start to see goroutines traces after previous test completion. You can run the entire test suite or a few selected tests.

If we put test to sleep at the start the first test TestAPPEND, we will notice all the initial goroutines started by TestMain. Following is the list of these 13 goroutines:

A. 8 goroutines are directly spawned from dicedb codebase.

Nested goroutines are child goroutines.

B. 5 other goroutines

2 goroutines related to pprof:

1 goroutine for time.Sleep 1 goroutine to handle low level signals - Sigqueue 1 gorutine from http/server.go, probably to support http servers

Ideally, number of goroutines should fall back to these 13 gorutines once a test is finished executing. You will notice this number will keep going up as you execute more tests.

Note: You might see 15 goroutines instead of 13 as TestMain also acquires a connection to fire ABORT command after all tests are executed. TestMain should acuqire this connection just before firing the command. I will be fixing that too as part of this issue, so I am ignore it.

Why is it happening?

Each test, gets a connection using getLocalConnection function. This request is received by AcceptConnectionRequests goroutine and it creates a new worker goroutine (G1) BaseWorker.Start. This worker goroutine creates another goroutine (G2) to read responses from the server BaseWorker.Read.

So every time a connection is requested, 2 new goroutines are created. G2 reads and G1 processes. G2 synchronises with G1 before exiting, but G1 doesn't. In fact G1 can return for various reasons and it doesn't communicate this to G2 which results in Leakages.

psrvere commented 3 days ago

I will raise fix PR for this in some time.