aerospike / aerospike-client-go

Aerospike Client Go
Apache License 2.0
430 stars 198 forks source link

block the goroutine #323

Closed linchuan4028 closed 3 years ago

linchuan4028 commented 3 years ago

below is the stack it's show the 361 minutes block the issue is found in production as we encounter one aerospike node down. It doesn't recovered even though the node back

goroutine 850040134 [semacquire, 361 minutes]:
sync.runtime_SemacquireMutex(0xc0003723bc, 0x1, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0003723b8)
        /usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
github.com/aerospike/aerospike-client-go.(*singleConnectionHeap).Poll(0xc000372390, 0x0)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/connection_heap.go:86 +0x103
github.com/aerospike/aerospike-client-go.(*connectionHeap).Poll(0xc000c460e0, 0x1fbb152, 0x6f68245f5)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/connection_heap.go:222 +0x8a
github.com/aerospike/aerospike-client-go.(*Node).getConnectionWithHint(0xc000c46000, 0xbfe248ca5478dcb8, 0xe44cd0191e9f, 0x1fbb160, 0x6f68245f5, 0xc009e5ee52, 0xcc1928, 0xc008e5fac0, 0xc00024c200)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/node.go:603 +0x52
github.com/aerospike/aerospike-client-go.(*singleCommand).getConnection(0xc00c7d3360, 0x16de980, 0xc00c7d32c0, 0x0, 0x0, 0x0)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/single_command.go:39 +0xd2
github.com/aerospike/aerospike-client-go.(*baseCommand).executeAt(0xc00c7d3360, 0x17054e0, 0xc00c7d3360, 0xc00c7d32c0, 0xc00c7d3200, 0xbfe248c411f75cbf, 0xe446fb79e461, 0x1fbb160, 0x0, 0x0, ...)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/command.go:1863 +0x30b
github.com/aerospike/aerospike-client-go.(*baseCommand).execute(0xc00c7d3360, 0x17054e0, 0xc00c7d3360, 0xc0065ac000, 0x1, 0x1)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/command.go:1795 +0xda
github.com/aerospike/aerospike-client-go.(*writeCommand).Execute(...)
        /root/go/pkg/mod/github.com/aerospike/aerospike-client-go@v3.1.0+incompatible/write_command.go:105
github.com/aerospike/aerospike-client-go.(*Client).PutBins(0xc000277c40, 0xc00c7d32c0, 0xc000644480, 0xc0065ac098, 0x1, 0x1, 0xc000644480, 0x0)
linchuan4028 commented 3 years ago

I guess this code might be the root cause

if the h.data is nil, there is no chance to call h.mutex.Unlock()

func (h *singleConnectionHeap) Poll() (res *Connection) {
    h.mutex.Lock()

    // the heap has been cleaned up
    if h.data == nil {
        return nil
    }

    // if heap is not empty
    if (h.tail != h.head) || h.full {
        res = h.data[h.head]
        h.data[h.head] = nil

        h.full = false
        if h.head == 0 {
            h.head = h.size - 1
        } else {
            h.head--
        }
    }

    h.mutex.Unlock()
    return res
}
linchuan4028 commented 3 years ago

https://github.com/aerospike/aerospike-client-go/pull/324/files

khaf commented 3 years ago

Thanks for your PR. I will reject it because we have been avoiding defer in hot path on purpose, but will include the fix and release it today.

khaf commented 3 years ago

The fix was released in v3.1.1 today.

xqzhang2015 commented 3 years ago

Hi @khaf I'm interested in the purpose to avoid defer here. What's the benefit of it? It seems using defer will not increase the locking duration for this function.

Thanks ahead.

khaf commented 3 years ago

Defer has its own overhead, that's why I've avoided it in the hot path. As you can see, I have used it in other parts of the code. Keep in mind that some of this may not show up in consumer CPUs. Our benchmarks were on a dual processor system with high core (64) CPUs.