Open Gaudeamus opened 1 year ago
Thanks for your feedback, I've been working on it since yesterday. Can you provide a bit more information regarding your cluster? How many nodes do you have and what usually prompts the issue (nodes departing the cluster, etc.)
Also, could you please paste the whole panic message so that I can see where the concurrent reads and writes are occurring? The current panic message does not include the part where the write is happening.
Hi @khaf!
Cluster: 4 nodes with Ubuntu 20.04 Aerospike Enterprise Edition build 6.2.0.3
Here are all 4 aerospike-related goroutines from panic log
goroutine 87917046 [running]:
github.com/aerospike/aerospike-client-go/v6.PartitionForRead(0xc0123d3cb8?, 0xc0999c7a90, 0xc059c39680)
/src/vendor/github.com/aerospike/aerospike-client-go/v6/partition.go:66 +0xa5
github.com/aerospike/aerospike-client-go/v6.newReadCommand(0xc317e6f200, 0xc0999c7a90, 0xc059c39680, {0xc25e780d20?, 0x5, 0x5}, 0x10?)
/src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:53 +0x90
github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012580000?, {0xc25e780d20, 0x5, 0x5})
/src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:355 +0x137
.../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc11530fa54?, {0xc25e780d20?, 0x12424a0?, 0xc7f41bb2d0?})
/src/aerospike/client.go:126 +0x98
.../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc11530fa54, 0x6}, {0xc380a96618?, 0x12?})
/src/aerospike/data.go:58 +0x14c
.../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc11530fa54?, 0x0?}, {0xc380a96618?, 0x1?})
/src/services/.../db/aerospike.go:64 +0xea
goroutine 88112780 [runnable]:
github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized.func1()
/src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489
created by github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized
/src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:489 +0xcd
goroutine 3062 [select]:
github.com/aerospike/aerospike-client-go/v6.(*Cluster).waitTillStabilized(0xc29bc9efc0)
/src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:515 +0x13a
github.com/aerospike/aerospike-client-go/v6.NewCluster(0xc0a23a6d00, {0xc6095e6468, 0x3, 0x3})
/src/vendor/github.com/aerospike/aerospike-client-go/v6/cluster.go:144 +0x72a
github.com/aerospike/aerospike-client-go/v6.NewClientWithPolicyAndHost(0x12ecce0?, {0xc6095e6468, 0x3, 0x3})
/src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:88 +0xf9
.../aerospike.(*Session).connect.func1()
/src/aerospike/aerospike.go:198 +0xbc
.../aerospike.(*Session).connect(0xc0999c7a40, {0xc6095e6468, 0x3, 0x3})
/src/aerospike/aerospike.go:209 +0x119
.../aerospike.(*Session).watch.func1()
/src/aerospike/aerospike.go:271 +0x229
.../aerospike.(*Session).watch
/src/aerospike/aerospike.go:241 +0x16a
goroutine 65471388 [sleep]:
time.Sleep(0xf4240)
/usr/local/go/src/runtime/time.go:195 +0x135
github.com/aerospike/aerospike-client-go/v6.(*baseCommand).executeAt(0xc056e81e40, {0x17aadb0, 0xc056e81e40}, 0xc0999c7a90, 0x0?, {0x0?, 0x0?, 0x217c960?}, 0x0?, 0x0)
/src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2472 +0x239
github.com/aerospike/aerospike-client-go/v6.(*baseCommand).execute(0x0?, {0x17aadb0, 0xc056e81e40}, 0x0?)
/src/vendor/github.com/aerospike/aerospike-client-go/v6/command.go:2440 +0x8a
github.com/aerospike/aerospike-client-go/v6.(*readCommand).Execute(...)
/src/vendor/github.com/aerospike/aerospike-client-go/v6/read_command.go:264
github.com/aerospike/aerospike-client-go/v6.(*Client).Get(0xc318983da0, 0x50?, 0xc012400400?, {0xc2aecac730, 0x5, 0x5})
/src/vendor/github.com/aerospike/aerospike-client-go/v6/client.go:360 +0x259
.../aerospike.(*Client).Get(0x14dced9?, 0xc?, 0xc2ee1a42b4?, {0xc2aecac730?, 0x12424a0?, 0xc29303ee70?})
/src/aerospike/client.go:126 +0x98
.../aerospike.(*Session).GetDataByClientUserId(0xc0999c7a40, {0xc2ee1a42b4, 0x6}, {0xc65be927e0?, 0x2084400?})
/src/aerospike/data.go:58 +0x14c
.../db.aerospikeService.GetClientData({{0x7f6016179038?, 0xc0999c7a40?}, 0xc0a1322230?}, {0xc2ee1a42b4?, 0x0?}, {0xc65be927e0?, 0x18?})
/src/services/.../db/aerospike.go:64 +0xea
Below are two logs with some events and timestamps surrounding the crush main log:
34311 2023/03/24 22:48:50 last logger message
...crush...
41762 2023/03/24 22:49:09 New logger started
aerospike.log
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:22 ResultCode: INVALID_NAMESPACE, Iteration: 0, InDoubt: false, Node: <nil>: Partition map empty
2023/03/24 22:41:32 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
read tcp 10.152.136.52:18548->10.152.129.108:3000: i/o timeout
2023/03/24 22:41:54 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: Timeout
read tcp 10.152.136.52:18734->10.152.129.108:3000: i/o timeout
2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9BDFE2A4B43E4 10.152.152.92:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
2023/03/24 22:45:49 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB915433E4B43E4 10.152.131.236:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
2023/03/24 22:46:50 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB901423E4B43E4 10.152.129.108:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
2023/03/24 22:52:20 ResultCode: TIMEOUT, Iteration: 1, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: command execution timed out on client: See `Policy.Timeout`
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: BB9B1C44D4B43E4 10.152.129.148:3000: Timeout
Just a heads up that I think I have identified the root cause of this issue, and the potential fix is coming with the next release early next week.
hi, @khaf! Could you give any updates, please?
@Gaudeamus Sorry to have been unresponsive, I thought I had replied to you. I just don't know how I've managed to miss your message. We were dealing with a few other issues and releasing the Go client kind of fell of the cracks. I will release the fix this week.
Sorry that this issue has fallen though the cracks. I just went through the client code, and it is not immediately clear to me how this issue can happen outside of constantly closing and opening a new client object.
@Gaudeamus Are you creating a new Client
object on each request? Or are you creating one and reusing that same object for all your requests?
The Client is stored and reused after creation. Although there is watching routine that creates new client and closes the old one periodically. For the reason of poor connectivity some nodes often become unavailable
@Gaudeamus I see two issues here:
After your last questions it occurred to me that I need to be more careful with switching clients. So I modified this part in a way that old client closes only after 1 minute delay since the new one was created. Idk if there is better solution, but it's seems to be working. Please direct me if there is any documentation related
It's an infra issue. We have consul client and it holds aerospike's active hosts. If network or other issue changes the list we reconnect like I've described above. After deeper investigation I found that majority of node list changes are very brief. So I can rely on aerospike client node status tracking without new client setup - and it's seems client is tracking nodes states and reconnects to the old ones after network issue resolves. I can assume that client should be working well with fixed list of hosts. But still we need this part of logic if some permanent changes are made to the cluster and hosts would be different from the original list. So could you please suggest reliable solution of how to manage this
Hello, working with remote cluster on slow network I'm facing with a regular problem causing entire app to crush client version v6.12.0
the same issue was with v5