Closed xqzhang2015 closed 3 years ago
Thanks @xqzhang2015 Let me think about this.
Out of curiosity, why do you set the timeout so short? I assume you are talking about ClientPolicy.Timeout
?
Not set such a timeout on PROD. Only for similar but simple reproducing
Hi @khaf,
I get this partition map, which contains nil node for partial partitions. From the stats, partial nodes only have 2 ConnectionsAttempts from the cluster stats.
From the debug log, this log never happens. So is there any other way to set the node pointer of a partition as nil?
if err := clstr.getPartitions().validate(); err != nil { Logger.Debug("Error validating the cluster partition map after tend: %s", err.Error()) }
BTW, this error aerospike client stat can't recover after running 5 days.
Backgroud: connecting timeout during init
cluster.go:151] New cluster was not initialized successfully, but the client will keep trying to connect to the database. Error: Connecting to the cluster timed out.
aerospike client version: v2.9.0
(dlv) p *(*"github.com/aerospike/aerospike-client-go.Partitions")(0xc001daf3c0) github.com/aerospike/aerospike-client-go.Partitions { Replicas: [][]*github.com/aerospike/aerospike-client-go.Node len: 2, cap: 2, [ [ *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420), *nil, *nil, *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0011482c0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *nil, *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0), *nil, *nil, *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000558b00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0011482c0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0), *nil, ...+4032 more ], [ *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000558b00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0011482c0), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340), *nil, *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000558b00), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0), *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420), ...+4032 more ], ], CPMode: false, regimes: []int len: 4096, cap: 4096, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...+4032 more],}
cluster.stats map[string]*github.com/aerospike/aerospike-client-go.nodeStats [ "10.1.1.94:3000": *{ConnectionsAttempts: 2, ConnectionsSuccessful: 2, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 1, TendsTotal: 358342, TendsSuccessful: 358341, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 1, NodeRemoved: 0}, "10.1.1.75:3000": *{ConnectionsAttempts: 519, ConnectionsSuccessful: 519, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 518, TendsTotal: 358341, TendsSuccessful: 358340, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 2, NodeRemoved: 1}, "10.1.1.79:3000": *{ConnectionsAttempts: 535, ConnectionsSuccessful: 535, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 534, TendsTotal: 358341, TendsSuccessful: 358340, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 2, NodeRemoved: 1}, "10.1.1.78:3000": *{ConnectionsAttempts: 524, ConnectionsSuccessful: 524, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 523, TendsTotal: 358341, TendsSuccessful: 358340, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 2, NodeRemoved: 1}, "10.1.1.61:3000": *{ConnectionsAttempts: 495, ConnectionsSuccessful: 495, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 494, TendsTotal: 358343, TendsSuccessful: 358342, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 1, NodeRemoved: 0}, "10.1.1.96:3000": *{ConnectionsAttempts: 2, ConnectionsSuccessful: 2, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 1, TendsTotal: 358342, TendsSuccessful: 358341, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 1, NodeRemoved: 0},
Thanks for your report. I spent some time yesterday and today on this issue, but I need to talk to some of my colleagues before any conclusions. Will report back tomorrow.
After talking to a few people, it turns out there are rare instances that re/starting nodes can cause this issue. I would imagine the same would apply here, since it can look like a node restart from the point of view of a client trying to connect to the database. We just addressed this issue recently in v4.1.0, by forcing all of a node's peers to also refresh upon a partition table change. I can't assert with 100% certainty that it will address your issue, but it is worth a shot. I will keep this partition map issues open, and will get back to them as soon as I release the next Go client version.
@khaf thanks for the help.
For my app, there are 5 Aerospike clients and only one client encounters such issue, which happens from aerospike client initing. And all other app instances work well. There was aerospike node restarting about 10 hours ago and cluster recovered hours before my app starting.
What I'm curious is when clstr.partitionWriteMap could get a nil node pointer. Because the setPartitions() func will validate the new partition map and invalid one will not be set to the cluster struct. Could you provide any related code snippet or link?
func (clstr *Cluster) setPartitions(partMap partitionMap) { if err := partMap.validate(); err != nil { Logger.Error("Partition map error: %s.", err.Error()) } clstr.partitionWriteMap.Store(partMap) }
Sorry I've been very slow responding to this and related issues. I was pushing to release a new client. I'll dedicate next week to the partitionMap issues. Please keep the feedback coming.
@khaf thanks for your attention so much.
I have added Cluster.Healthy()
in v5.2.0. Let me know if that helps.
We ever encounter partition map issues(partition with nil node) during runtime(like 1 hour after initing) more than once, which leads to Get/BatchGet failures.
Client.IsConnected() can't reflect if there is a nil node for a partition. Correct me if my understanding is wrong. Is it possible to add a flag to indicate health status?
func (clstr *Cluster) Healthy() bool { // checking partitionWriteMap // checking others? return true }