Closed alexanderturner closed 8 months ago
After an extended period, the nodes begin to panic with a nil pointer deference when trying to RLock the Mutex protecting the getSnapshot() in kvstore.
Looks like an etcd issue instead of raft issue? Please provide more detailed info,
Thanks @ahrtr
I'm recreating with:
func (a *App) Start(ctx context.Context) error {
raftNode := make(chan raft.Raft)
go raft.New(ctx, a.Config.Join, a.Config.Cluster, a.Config.ID, raftNode, &wg, a.Config.DataDir)
a.Node = <-raftNode
for {
if a.Node.GetStatus() == "StateLeader" {
time.Sleep(1000 * time.Second)
fmt.Println("Leader node!")
for i := 1; i < 200000; i++ {
key := fmt.Sprintf("key%d", i)
value := fmt.Sprintf("key%d", i)
fmt.Printf("key added %d\n", key)
a.Node.Propose(key, value)
}
break
}
}
}
I'm seeing
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x922a88]
goroutine 1 [running]:
src/raft.(*RaftNode).GetStatus(0xf48180?)
/app/src/raft/raft.go:585 +0x28
src/app.(*App).Start(0xc000173db0, {0xf4e598?, 0xc00048a2d0})
/app/src/app/app.go:50 +0x2d5
main.main()
/app/main.go:133 +0x875
Where raft.go:585 is
func (rc *RaftNode) GetStatus() string {
return rc.node.Status().SoftState.RaftState.String()
}
Which is returning nil and causing panic. I'm running go.etcd.io/etcd/raft/v3 v3.5.10
on Alpine 1.21.
After an extended period, the nodes begin to panic with a nil pointer deference when trying to RLock the Mutex protecting the getSnapshot() in kvstore.
When I saw "kvstore", I was thinking it's etcd's kvstore. It turned out to be just user application's concept. So this is not a raft issue, nor etcd's issue. Instead it should be application's issue.
Based on your input, it should be that rc.node
is nil or an invalid pointer when calling (*RaftNode) GetStatus()
, most likely it's caused by race condition, e.g. the raft node has somehow been stopped. Please investigate your application.
I'm using this implementation of RAFT in an application which is largely based on the example. It's running across three nodes and has no issues maintaining quorum, etc.
After an extended period, the nodes begin to panic with a nil pointer deference when trying to RLock the Mutex protecting the getSnapshot() in kvstore.
Strangely, this has been challenging to reproduce with the only curious indicator being the commit being 10,001 leading me to a storage issue or otherwise. Are there any known issues which could cause this? Apologies for the issue, just wanted to get it in front of the right eyes.