etcd-io / raft

Raft library for maintaining a replicated state machine
Apache License 2.0
630 stars 160 forks source link

Panic at commit 10,001 #119

Closed alexanderturner closed 8 months ago

alexanderturner commented 9 months ago

I'm using this implementation of RAFT in an application which is largely based on the example. It's running across three nodes and has no issues maintaining quorum, etc.

After an extended period, the nodes begin to panic with a nil pointer deference when trying to RLock the Mutex protecting the getSnapshot() in kvstore.

Strangely, this has been challenging to reproduce with the only curious indicator being the commit being 10,001 leading me to a storage issue or otherwise. Are there any known issues which could cause this? Apologies for the issue, just wanted to get it in front of the right eyes.

ahrtr commented 9 months ago

After an extended period, the nodes begin to panic with a nil pointer deference when trying to RLock the Mutex protecting the getSnapshot() in kvstore.

Looks like an etcd issue instead of raft issue? Please provide more detailed info,

alexanderturner commented 8 months ago

Thanks @ahrtr

I'm recreating with:

func (a *App) Start(ctx context.Context) error {
    raftNode := make(chan raft.Raft)
    go raft.New(ctx, a.Config.Join, a.Config.Cluster, a.Config.ID, raftNode, &wg, a.Config.DataDir)
    a.Node = <-raftNode

    for {
        if a.Node.GetStatus() == "StateLeader" {
            time.Sleep(1000 * time.Second)
            fmt.Println("Leader node!")
            for i := 1; i < 200000; i++ {
                key := fmt.Sprintf("key%d", i)
                value := fmt.Sprintf("key%d", i)
                fmt.Printf("key added %d\n", key)
                a.Node.Propose(key, value)
            }
            break
        }
    }
}

I'm seeing

 panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x922a88]

 goroutine 1 [running]:
 src/raft.(*RaftNode).GetStatus(0xf48180?)
    /app/src/raft/raft.go:585 +0x28
 src/app.(*App).Start(0xc000173db0, {0xf4e598?, 0xc00048a2d0})
    /app/src/app/app.go:50 +0x2d5
 main.main()
    /app/main.go:133 +0x875

Where raft.go:585 is

 func (rc *RaftNode) GetStatus() string {
    return rc.node.Status().SoftState.RaftState.String()
}

Which is returning nil and causing panic. I'm running go.etcd.io/etcd/raft/v3 v3.5.10 on Alpine 1.21.

ahrtr commented 8 months ago

After an extended period, the nodes begin to panic with a nil pointer deference when trying to RLock the Mutex protecting the getSnapshot() in kvstore.

When I saw "kvstore", I was thinking it's etcd's kvstore. It turned out to be just user application's concept. So this is not a raft issue, nor etcd's issue. Instead it should be application's issue.

Based on your input, it should be that rc.node is nil or an invalid pointer when calling (*RaftNode) GetStatus(), most likely it's caused by race condition, e.g. the raft node has somehow been stopped. Please investigate your application.