cosmos / iavl

Merkleized IAVL+ Tree implementation in Go
Apache License 2.0
423 stars 264 forks source link

IAVL crashes with "value missing hash" under high load #240

Closed p4u closed 4 years ago

p4u commented 4 years ago

When performing a benchmark under high load operation, IAVL panics. We have seen this error on different servers and many times already. Looks like there is something missing on the IAVL tree, but not sure what. After re-launching the program everything goes back to normal and IAVL works fine.

panic: Value missing for hash 5c762dd92f621bd3df4f93cdd153d5639590de63f8be117c317615faab68c0c6 corresponding to nodeKey n\v-�/b�O���S�c���c��|1v��h��

goroutine 1452118376 [running]:
github.com/tendermint/iavl.(*nodeDB).GetNode(0xc001fd0600, 0xc0724f3ea0, 0x20, 0x20, 0x0)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/nodedb.go:109 +0x7bf
github.com/tendermint/iavl.(*Node).getRightNode(...)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:417
github.com/tendermint/iavl.(*Node).get(0xc071dd45a0, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0xffffffffffffffff, 0xc0c934ad40, 0xc0475daff2)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:174 +0x2f2
github.com/tendermint/iavl.(*Node).get(0xc071dd4500, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0x1, 0x1, 0x30ab980)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:172 +0x1ac
github.com/tendermint/iavl.(*Node).get(0xc071c8bf40, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0xffffffffffffffff, 0xc017393410, 0xe8b266)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:175 +0x26f
github.com/tendermint/iavl.(*Node).get(0xc071f69cc0, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0x1, 0xf, 0xc0054bea00)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:172 +0x1ac
github.com/tendermint/iavl.(*Node).get(0xc071f69c20, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x90, 0x90, 0x81, 0xc021c80c60)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:175 +0x26f
github.com/tendermint/iavl.(*ImmutableTree).Get(0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0x90, 0xc0173934a8, 0xf6cf71)
    /go/pkg/mod/github.com/tendermint/iavl@v0.13.3/immutable_tree.go:157 +0x5a
gitlab.com/vocdoni/go-dvote/vochain.(*State).Envelope(0xc0341e7380, 0xc021c80bd0, 0x81, 0x1, 0x2, 0xc021c80bd0, 0x81)
    /src/vochain/iavl.go:477 +0xd3
gitlab.com/vocdoni/go-dvote/router.(*Router).getEnvelopeStatus(0xc0267c2690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /src/router/voteCallbacks.go:98 +0x22c
created by gitlab.com/vocdoni/go-dvote/router.(*Router).Route
    /src/router/router.go:284 +0x429
tessr commented 4 years ago

Hello! Thanks for giving us a heads up on this. We'll take a look ASAP.

p4u commented 4 years ago

Thanks for the quick reply @tessr

That could be a data race in our side. We are investigating it... we'll report more information once available.

erikgrinaker commented 4 years ago

Yeah, I don't believe IAVL is concurrency-safe (although I'm not 100% sure), so if the application is making concurrent accesses - in particular writes - then I believe you'd see problems like this under load. Are you using IAVL concurrently? Do you see the same problems if you put an RWMutex around it?

p4u commented 4 years ago

We are using a RWMutex already:

However we suspect that the fact that we do use pointers

app.State.IAppTree = app.State.AppTree.ImmutableTree

Could raise a data race when app.State.AppTree changes.

erikgrinaker commented 4 years ago

Ah - yes, I think that might be a cause. Let us know what you find, and if there's a problem with IAVL itself it would be very helpful with a minimal reproducible case.

mvdan commented 4 years ago

@erikgrinaker just to double check - if one is running write operations on the mutable tree, such as tree.Set, it's not safe to concurrently call read operations on its immutable tree, such as tree.ImmutableTree.Get. I assume this is the case because, from reading your code, I can see that a mutable tree operation can modify the immutable tree, such as modifying the immutable tree's root node.

If the above is correct, then it's very probably a data race on our side. We do have a mutex guarding the immutable tree, like @p4u mentioned, but the write operations on the mutable tree aren't guarded by the same mutex. So we might end up calling tree.Set and tree.ImmutableTree.Get concurrently.

erikgrinaker commented 4 years ago

I'm not too familiar with the details here, but that would not surprise me - for example when removing orphan nodes or pruning old versions. To be safe, I would use a mutex for all operations against the entire IAVL store.

We should really specify the exact concurrency semantics for IAVL, and maybe down the line implement some form of concurrency control.

mvdan commented 4 years ago

Thanks - sounds good with me. I was going to suggest that the docs be clarified a bit, for now. The immutable tree does say Note that this tree is not thread-safe., but we didn't understand that to also mean operations on the parent mutable tree.

erikgrinaker commented 4 years ago

For sure, we should clarify what the intended behavior here is. It is sort of unintuitive that an immutable tree is, well, mutable.

mvdan commented 4 years ago

Yes, that's a good point. I think that's what caused confusion for us - the ImmutableTree name seems to imply that it's a completely immutable and static snapshot of a tree, when it's not. It's meant for immutable use, but it itself is not immutable.

erikgrinaker commented 4 years ago

I'll close this for now, but do let us know if you still see problems even with a mutex around all operations.

p4u commented 4 years ago

So I can confirm that the issue was in our side due a data race.

It's also fixed in our side and after running several benchmarks, no Panic found.

Thanks for your time and help @erikgrinaker