aerospike / aerospike-client-go

Aerospike Client Go
Apache License 2.0
430 stars 198 forks source link

Expose the the healthy status of partitionWriteMap #334

Closed xqzhang2015 closed 3 years ago

xqzhang2015 commented 3 years ago

We ever encounter partition map issues(partition with nil node) during runtime(like 1 hour after initing) more than once, which leads to Get/BatchGet failures.

Client.IsConnected() can't reflect if there is a nil node for a partition. Correct me if my understanding is wrong. Is it possible to add a flag to indicate health status?

func (clstr *Cluster) Healthy() bool { // checking partitionWriteMap // checking others? return true }



* Reproduce partitionWriteMap error
  * set a short timeout duration(so that always failed to fetch all nodes/peers => partial failed nodes)
  * output debug logs
  * get keys continuously
khaf commented 3 years ago

Thanks @xqzhang2015 Let me think about this.

Out of curiosity, why do you set the timeout so short? I assume you are talking about ClientPolicy.Timeout?

xqzhang2015 commented 3 years ago

Not set such a timeout on PROD. Only for similar but simple reproducing

xqzhang2015 commented 3 years ago

Hi @khaf,

I get this partition map, which contains nil node for partial partitions. From the stats, partial nodes only have 2 ConnectionsAttempts from the cluster stats.

From the debug log, this log never happens. So is there any other way to set the node pointer of a partition as nil?

    if err := clstr.getPartitions().validate(); err != nil {
        Logger.Debug("Error validating the cluster partition map after tend: %s", err.Error())
    }

BTW, this error aerospike client stat can't recover after running 5 days.

p.cluster.partitionWriteMap
(dlv) p *(*"github.com/aerospike/aerospike-client-go.Partitions")(0xc001daf3c0)
github.com/aerospike/aerospike-client-go.Partitions {
    Replicas: [][]*github.com/aerospike/aerospike-client-go.Node len: 2, cap: 2, [
        [
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420),
            *nil,
            *nil,
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0011482c0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *nil,
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0),
            *nil,
            *nil,
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000558b00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0011482c0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0),
            *nil,
            ...+4032 more
        ],
        [
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b1d1e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000558b00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0011482c0),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000aa3080),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00025cb00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243a20),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000d14420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000c2c6e0),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc001522580),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0015226e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc0004fa420),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000a249a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000243340),
            *nil,
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000b946e0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000558b00),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc00071a9a0),
            *(*"github.com/aerospike/aerospike-client-go.Node")(0xc000f62420),
            ...+4032 more
        ],
    ],
    CPMode: false,
    regimes: []int len: 4096, cap: 4096, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...+4032 more],}
node stats
cluster.stats
map[string]*github.com/aerospike/aerospike-client-go.nodeStats [
    "10.1.1.94:3000": *{ConnectionsAttempts: 2, ConnectionsSuccessful: 2, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 1, TendsTotal: 358342, TendsSuccessful: 358341, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 1, NodeRemoved: 0},
    "10.1.1.75:3000": *{ConnectionsAttempts: 519, ConnectionsSuccessful: 519, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 518, TendsTotal: 358341, TendsSuccessful: 358340, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 2, NodeRemoved: 1},
    "10.1.1.79:3000": *{ConnectionsAttempts: 535, ConnectionsSuccessful: 535, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 534, TendsTotal: 358341, TendsSuccessful: 358340, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 2, NodeRemoved: 1},
    "10.1.1.78:3000": *{ConnectionsAttempts: 524, ConnectionsSuccessful: 524, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 523, TendsTotal: 358341, TendsSuccessful: 358340, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 2, NodeRemoved: 1},
    "10.1.1.61:3000": *{ConnectionsAttempts: 495, ConnectionsSuccessful: 495, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 494, TendsTotal: 358343, TendsSuccessful: 358342, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 1, NodeRemoved: 0},
    "10.1.1.96:3000": *{ConnectionsAttempts: 2, ConnectionsSuccessful: 2, ConnectionsFailed: 1, ConnectionsPoolEmpty: 0, ConnectionsOpen: 0, ConnectionsClosed: 1, TendsTotal: 358342, TendsSuccessful: 358341, TendsFailed: 1, PartitionMapUpdates: 1, NodeAdded: 1, NodeRemoved: 0},
khaf commented 3 years ago

Thanks for your report. I spent some time yesterday and today on this issue, but I need to talk to some of my colleagues before any conclusions. Will report back tomorrow.

khaf commented 3 years ago

After talking to a few people, it turns out there are rare instances that re/starting nodes can cause this issue. I would imagine the same would apply here, since it can look like a node restart from the point of view of a client trying to connect to the database. We just addressed this issue recently in v4.1.0, by forcing all of a node's peers to also refresh upon a partition table change. I can't assert with 100% certainty that it will address your issue, but it is worth a shot. I will keep this partition map issues open, and will get back to them as soon as I release the next Go client version.

xqzhang2015 commented 3 years ago

@khaf thanks for the help.

For my app, there are 5 Aerospike clients and only one client encounters such issue, which happens from aerospike client initing. And all other app instances work well. There was aerospike node restarting about 10 hours ago and cluster recovered hours before my app starting.

What I'm curious is when clstr.partitionWriteMap could get a nil node pointer. Because the setPartitions() func will validate the new partition map and invalid one will not be set to the cluster struct. Could you provide any related code snippet or link?

setPartitions
func (clstr *Cluster) setPartitions(partMap partitionMap) {
    if err := partMap.validate(); err != nil {
        Logger.Error("Partition map error: %s.", err.Error())
    }
    clstr.partitionWriteMap.Store(partMap)
}
khaf commented 3 years ago

Sorry I've been very slow responding to this and related issues. I was pushing to release a new client. I'll dedicate next week to the partitionMap issues. Please keep the feedback coming.

xqzhang2015 commented 3 years ago

@khaf thanks for your attention so much.

khaf commented 3 years ago

I have added Cluster.Healthy() in v5.2.0. Let me know if that helps.