aerospike / aerospike-client-go

Aerospike Client Go
Apache License 2.0
430 stars 199 forks source link

Client v6 occasionally crushes with concurrent map read and map write #399

Open Gaudeamus opened 1 year ago

Gaudeamus commented 1 year ago

Hello, working with remote cluster on slow network I'm facing with a regular problem causing entire app to crush client version v6.12.0

fatal error: concurrent map read and map write

goroutine 87917046 [running]:
github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137

the same issue was with v5

github.com/aerospike/aerospike-client-go/v5.PartitionForWrite(0x0?, 0xc0a44b3830, 0xc4d432d800)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/partition.go:52 +0x96
github.com/aerospike/aerospike-client-go/v5.newExecuteCommand(0x7?, 0xc0a44b3830, 0x3?, {0x146b1ff, _}, {_, _}, _)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/execute_command.go:35 +0x9a
github.com/aerospike/aerospike-client-go/v5.(*Client).Execute(0xc3ec36ab40, 0xc01bc20970?, 0xc3bc5f?, {0x146b1ff, 0x9}, {0x146dec9, 0xb}, {0xc4d69cecb0, 0x1, 0x1})
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/client.go:738 +0x25c
khaf commented 1 year ago

Thanks for your feedback, I've been working on it since yesterday. Can you provide a bit more information regarding your cluster? How many nodes do you have and what usually prompts the issue (nodes departing the cluster, etc.)

khaf commented 1 year ago

Also, could you please paste the whole panic message so that I can see where the concurrent reads and writes are occurring? The current panic message does not include the part where the write is happening.

Gaudeamus commented 1 year ago

Hi @khaf!

Cluster: 4 nodes with Ubuntu 20.04 Aerospike Enterprise Edition build 6.2.0.3

Here are all 4 aerospike-related goroutines from panic log

        goroutine 87917046 [running]:
        github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
        github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc11530fa54?, {0xc25e780d20?, 0x12424a0?, 0xc7f41bb2d0?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc11530fa54, 0x6}, {0xc380a96618?, 0x12?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc11530fa54?, 0x0?}, {0xc380a96618?, 0x1?})
                /src/services/.../db/aerospike.go:64 +0xea

        goroutine 88112780 [runnable]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized.func1()
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489
        created by github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489 +0xcd

        goroutine 3062 [select]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized(0xc29bc9efc0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:515 +0x13a
        github.com/aerospike/aerospike-client-go/v6.NewCluster(0xc0a23a6d00, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:144 +0x72a
        github.com/aerospike/aerospike-client-go/v6.NewClientWithPolicyAndHost(0x12ecce0?, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:88 +0xf9
        .../aerospike.(*Session).connect.func1()
                /src/aerospike/aerospike.go:198 +0xbc
        .../aerospike.(*Session).connect(0xc0999c7a40, {0xc6095e6468, 0x3, 0x3})
                /src/aerospike/aerospike.go:209 +0x119
        .../aerospike.(*Session).watch.func1()
                /src/aerospike/aerospike.go:271 +0x229
        .../aerospike.(*Session).watch
                /src/aerospike/aerospike.go:241 +0x16a

        goroutine 65471388 [sleep]:
        time.Sleep(0xf4240)
                /usr/local/go/src/runtime/time.go:195 +0x135
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).executeAt(0xc056e81e40, {0x17aadb0, 0xc056e81e40}, 0xc0999c7a90, 0x0?, {0x0?, 0x0?, 0x217c960?}, 0x0?, 0x0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2472 +0x239
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).execute(0x0?, {0x17aadb0, 0xc056e81e40}, 0x0?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2440 +0x8a
        github.com/aerospike/aerospike-client-go/v6.(*readCommand).Execute(...)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:264
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012400400?, {0xc2aecac730, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:360 +0x259
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc2ee1a42b4?, {0xc2aecac730?, 0x12424a0?, 0xc29303ee70?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc2ee1a42b4, 0x6}, {0xc65be927e0?, 0x2084400?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc2ee1a42b4?, 0x0?}, {0xc65be927e0?, 0x18?})
                /src/services/.../db/aerospike.go:64 +0xea

Below are two logs with some events and timestamps surrounding the crush main log:

        34311  2023/03/24 22:48:50 last logger message
        ...crush...
        41762  2023/03/24 22:49:09 New logger started

aerospike.log

        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:32 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18548->10.152.129.108:3000: i/o timeout
        2023/03/24 22:41:54 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18734->10.152.129.108:3000: i/o timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:46:50 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:52:20 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: Timeout
khaf commented 1 year ago

Just a heads up that I think I have identified the root cause of this issue, and the potential fix is coming with the next release early next week.

Gaudeamus commented 1 year ago

hi, @khaf! Could you give any updates, please?

khaf commented 1 year ago

@Gaudeamus Sorry to have been unresponsive, I thought I had replied to you. I just don't know how I've managed to miss your message. We were dealing with a few other issues and releasing the Go client kind of fell of the cracks. I will release the fix this week.