aerospike / aerospike-client-go

Aerospike Client Go
Apache License 2.0
436 stars 199 forks source link

Client v6 occasionally crushes with concurrent map read and map write #399

Open Gaudeamus opened 1 year ago

Gaudeamus commented 1 year ago

Hello, working with remote cluster on slow network I'm facing with a regular problem causing entire app to crush client version v6.12.0

fatal error: concurrent map read and map write

goroutine 87917046 [running]:
github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
        /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137

the same issue was with v5

github.com/aerospike/aerospike-client-go/v5.PartitionForWrite(0x0?, 0xc0a44b3830, 0xc4d432d800)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/partition.go:52 +0x96
github.com/aerospike/aerospike-client-go/v5.newExecuteCommand(0x7?, 0xc0a44b3830, 0x3?, {0x146b1ff, _}, {_, _}, _)
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/execute_command.go:35 +0x9a
github.com/aerospike/aerospike-client-go/v5.(*Client).Execute(0xc3ec36ab40, 0xc01bc20970?, 0xc3bc5f?, {0x146b1ff, 0x9}, {0x146dec9, 0xb}, {0xc4d69cecb0, 0x1, 0x1})
        /src/dsp/vendor/github.com/aerospike/aerospike-client-go/v5/client.go:738 +0x25c
khaf commented 1 year ago

Thanks for your feedback, I've been working on it since yesterday. Can you provide a bit more information regarding your cluster? How many nodes do you have and what usually prompts the issue (nodes departing the cluster, etc.)

khaf commented 1 year ago

Also, could you please paste the whole panic message so that I can see where the concurrent reads and writes are occurring? The current panic message does not include the part where the write is happening.

Gaudeamus commented 1 year ago

Hi @khaf!

Cluster: 4 nodes with Ubuntu 20.04 Aerospike Enterprise Edition build 6.2.0.3

Here are all 4 aerospike-related goroutines from panic log

        goroutine 87917046 [running]:
        github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
        github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc11530fa54?, {0xc25e780d20?, 0x12424a0?, 0xc7f41bb2d0?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc11530fa54, 0x6}, {0xc380a96618?, 0x12?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc11530fa54?, 0x0?}, {0xc380a96618?, 0x1?})
                /src/services/.../db/aerospike.go:64 +0xea

        goroutine 88112780 [runnable]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized.func1()
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489
        created by github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489 +0xcd

        goroutine 3062 [select]:
        github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized(0xc29bc9efc0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:515 +0x13a
        github.com/aerospike/aerospike-client-go/v6.NewCluster(0xc0a23a6d00, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:144 +0x72a
        github.com/aerospike/aerospike-client-go/v6.NewClientWithPolicyAndHost(0x12ecce0?, {0xc6095e6468, 0x3, 0x3})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:88 +0xf9
        .../aerospike.(*Session).connect.func1()
                /src/aerospike/aerospike.go:198 +0xbc
        .../aerospike.(*Session).connect(0xc0999c7a40, {0xc6095e6468, 0x3, 0x3})
                /src/aerospike/aerospike.go:209 +0x119
        .../aerospike.(*Session).watch.func1()
                /src/aerospike/aerospike.go:271 +0x229
        .../aerospike.(*Session).watch
                /src/aerospike/aerospike.go:241 +0x16a

        goroutine 65471388 [sleep]:
        time.Sleep(0xf4240)
                /usr/local/go/src/runtime/time.go:195 +0x135
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).executeAt(0xc056e81e40, {0x17aadb0, 0xc056e81e40}, 0xc0999c7a90, 0x0?, {0x0?, 0x0?, 0x217c960?}, 0x0?, 0x0)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2472 +0x239
        github.com/aerospike/aerospike-client-go/v6.(*baseCommand).execute(0x0?, {0x17aadb0, 0xc056e81e40}, 0x0?)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2440 +0x8a
        github.com/aerospike/aerospike-client-go/v6.(*readCommand).Execute(...)
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:264
        github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012400400?, {0xc2aecac730, 0x5, 0x5})
                /src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:360 +0x259
        .../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc2ee1a42b4?, {0xc2aecac730?, 0x12424a0?, 0xc29303ee70?})
                /src/aerospike/client.go:126 +0x98
        .../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc2ee1a42b4, 0x6}, {0xc65be927e0?, 0x2084400?})
                /src/aerospike/data.go:58 +0x14c
        .../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc2ee1a42b4?, 0x0?}, {0xc65be927e0?, 0x18?})
                /src/services/.../db/aerospike.go:64 +0xea

Below are two logs with some events and timestamps surrounding the crush main log:

        34311  2023/03/24 22:48:50 last logger message
        ...crush...
        41762  2023/03/24 22:49:09 New logger started

aerospike.log

        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
        2023/03/24 22:41:32 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18548->10.152.129.108:3000: i/o timeout
        2023/03/24 22:41:54 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
        read tcp 10.152.136.52:18734->10.152.129.108:3000: i/o timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:46:50 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: network error. Checked the wrapped error for detail
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
        2023/03/24 22:52:20 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
        ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: Timeout
khaf commented 1 year ago

Just a heads up that I think I have identified the root cause of this issue, and the potential fix is coming with the next release early next week.

Gaudeamus commented 1 year ago

hi, @khaf! Could you give any updates, please?

khaf commented 1 year ago

@Gaudeamus Sorry to have been unresponsive, I thought I had replied to you. I just don't know how I've managed to miss your message. We were dealing with a few other issues and releasing the Go client kind of fell of the cracks. I will release the fix this week.

khaf commented 1 month ago

Sorry that this issue has fallen though the cracks. I just went through the client code, and it is not immediately clear to me how this issue can happen outside of constantly closing and opening a new client object.

@Gaudeamus Are you creating a new Client object on each request? Or are you creating one and reusing that same object for all your requests?

Gaudeamus commented 1 month ago

The Client is stored and reused after creation. Although there is watching routine that creates new client and closes the old one periodically. For the reason of poor connectivity some nodes often become unavailable

khaf commented 3 weeks ago

@Gaudeamus I see two issues here:

  1. It seems that you are closing the client while there are still transactions ongoing. That would cause a race condition that you are seeing here.
  2. The issue of some nodes becoming unavailable is concerning to me. Is that a client issue or an infra issue? Do you mean that the client does not keep up with the state of the cluster and never connects to some new nodes?
Gaudeamus commented 3 weeks ago
  1. After your last questions it occurred to me that I need to be more careful with switching clients. So I modified this part in a way that old client closes only after 1 minute delay since the new one was created. Idk if there is better solution, but it's seems to be working. Please direct me if there is any documentation related

  2. It's an infra issue. We have consul client and it holds aerospike's active hosts. If network or other issue changes the list we reconnect like I've described above. After deeper investigation I found that majority of node list changes are very brief. So I can rely on aerospike client node status tracking without new client setup - and it's seems client is tracking nodes states and reconnects to the old ones after network issue resolves. I can assume that client should be working well with fixed list of hosts. But still we need this part of logic if some permanent changes are made to the cluster and hosts would be different from the original list. So could you please suggest reliable solution of how to manage this

khaf commented 3 weeks ago
  1. So I assume the issue is resolved.
  2. People usually use a DNS round-robin and rotate in and out the nodes in there; So you just use the one (a two for availability) seed and the rest will work from there.