Low level db error can cause app-hash mismatch

yihuang commented 2 years ago

Summary of Bug

During delivering tx, when there are errors happening in low-level DB, for example, "too many opened files", tm-db turns them into panic which is recovered by the tx runner and treated as a failed tx execution result, which results in state inconsistency and manifested as an app-hash mismatch error in the next block. Maybe it's better to stop processing the current block in such cases.

Edit: A new case that we think may be related to this: https://github.com/cosmos/cosmos-sdk/issues/12012#issuecomment-1308209563

Version

Steps to Reproduce

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

faddat commented 2 years ago

Hey guys I would not consider this a bug.

Instead, I would include directions on setting the number of open files value. In most Linux distributions the command to set them is this:

ulimit -n 500000

That will set it for the length of the session.

It can also be set in systemd units

and in the system file /etc/security/limits.conf

yihuang commented 2 years ago

Maybe not a bug, just an improvement, since the app-hash mismatch situation is much hard to recover from.

I'm not sure if there are other low level errors which could trigger the same issue.

faddat commented 2 years ago

Maybe we can add a check?

This is actually frequently encountered. The scenario are describing here is very familiar to me. I run into it basically anytime that I have failed to shut the file system limit. Some interesting feedback on this can be found in my performance branch of tm-db, which is basically the combination of terra's performance branch, and some additional tweaks that I have made.

Insight:

maybe sdk should check if that limit is above 33,000.

We use the 500,000 value on servers that run many nodes for relaying.

yihuang commented 2 years ago

do you think there are other low-level DB errors that should not be ignored in the consensus level? I think we should separate out some system-level errors to stop the node rather than commit a wrong block state.

The error message we see:

recovered: can't get node 6953892A1BDD69D87894970AE3CD3C04B25C9F3CA7774AF1ACD8DBF58ABE4F5D: open /chain/.cronosd/data/application.db/535337124.ldb: too many open files

ValarDragon commented 2 years ago

I think the general reaction to a DB error should be halt the node.

The particular case of a too many files opened is a node operator level error, but the general case of the many db errors that can occur is not.

yihuang commented 2 years ago

Should we change tm-db's apis to return error, and those errors should abort the node?

peterbourgon commented 2 years ago

I think the general reaction to a DB error should be halt the node.

The particular case of a too many files opened is a node operator level error, but the general case of the many db errors that can occur is not.

Unfortunately, it's not possible for the SDK to effectively and/or safely halt the node.

Calling panic should terminate the process and halt the node, of course. But because panic is used throughout the ecosystem as a normal way to signal errors, developers necessarily must use recover as a normal way to capture and handle errors, and effectively no deferred recover statement bubbles up caught panics. That means code can't make any assumptions about the effect of a panic, beyond that it exits the current call frame.

This unpredictable behavior has real consequences! I've been running some full nodes across several Cosmos-based networks recently. Stopping and re-starting those processes is always a highly stressful experience. The SIGINT isn't caught and managed in any well-defined way, and essentially triggers panics, errors, or nothing at all, across all active call stacks. If one of those call stacks happens to be in the middle of an ABCI Commit, or midway through persisting some state to disk, or any one of a dozen more fragile operations like the one identified in this issue, then the state gets corrupted, almost always in a non-recoverable way, and I have to rm -rf and start again from a snapshot.

More generally, errors encountered when operations cross an abstraction layer — e.g. DB errors, among many others — are, like most other runtime errors, totally normal. A read transaction can timeout because the underlying resource is busy with other callers. A write transaction can fail because the optimistic concurrency assumptions made by the query planner don't hold. A remote read or write can fail because of arbitrary and random latency at one of hundreds of potential levels of translation. These situations aren't exceptional, and they almost never signal malice or Byzantine behavior. There's no reason for the SDK to halt the chain if a DB write times out, for example. Doing so makes the SDK fundamentally unreliable software.

alexanderbez commented 2 years ago

But because panic is used throughout the ecosystem as a normal way to signal errors

What do you mean by ecosystem here? The broader Go ecosystem, or the SDK? If the latter, that's not the case. Panics are only triggered when the desired affect is to halt the node, the exception being message execution (which indicates message failure), otherwise, e.g. panic in EndBlock, the node will gracefully halt.

I've been running some full nodes across several Cosmos-based networks recently. Stopping and re-starting those processes is always a highly stressful experience.

How so? I've never had any issues restarting nodes. I typically restart nodes multiple times a day without any issues.

The SIGINT isn't caught and managed in any well-defined way, and essentially triggers panics, errors, or nothing at all, across all active call stacks.

This is not true. SIGINT is captured in a single place.

peterbourgon commented 2 years ago

What do you mean by ecosystem here? The broader Go ecosystem, or the SDK? If the latter, that's not the case. Panics are only triggered when the desired affect is to halt the node,

Panics are absolutely used as an alternative way to signal errors throughout Tendermint and the SDK. Consider the BasicKVStore interface:

// BasicKVStore is a simple interface to get/set data
type BasicKVStore interface {
    // Get returns nil if key doesn't exist. Panics on nil key.
    Get(key []byte) []byte

    // Has checks if a key exists. Panics on nil key.
    Has(key []byte) bool

    // Set sets the key. Panics on nil key or value.
    Set(key, value []byte)

    // Delete deletes the key. Panics on nil key.
    Delete(key []byte)
}

I was prepared write a fair bit more to explain the details of this example. About how all of these operations are in fact fallible, and how fallible operations in Go are idiomatically modeled by (..., error) returns, and why this way of modeling fallible operations is important, and etc. etc. — but the comments here really do all of the work for me 😉 All of these methods can fail for arbitrary, implementation-specific reasons. They should all should return errors. Providing a nil key to a Get method is perhaps a bug, but it's surely not a condition that should terminate the process. And this is one of countless examples.

the exception being message execution (which indicates message failure), otherwise, e.g. panic in EndBlock, the node will gracefully halt.

First, while the SDK can issue a panic, it can't make any assertions about the effects of that panic on the node. That's because the SDK is always encapsulated by the application, and applications routinely recover from panics as a matter of practical necessity.

Second, even if the SDK could reliably assert that a panic would actually bubble up to the main goroutine of execution, by definition that can't be assumed to result in graceful termination of the process. Just at a base level, because that panic is invisible to any goroutines spawned from outside of the panicing call stack; those goroutines will, practically, be hard-killed as a result. But in addition, because Go code in general can't make any assumptions about panic recovery beyond its API borders. Specifically, I mean that a Cosmos app is never obliged, and can't be assumed, to recover from panics thrown by the SDK. That's an invariant defined by the language. And since the application owns the lifecycle of many of the important stateful components in a node, the SDK unfortunately can't make any assertions about how panics will affect them.

How so? I've never had any issues restarting nodes. I typically restart nodes multiple times a day without any issues . . . SIGINT is captured in a single place.

I guess I'm not sure how to respond. Which networks do you run? Would a collection of my journald logs — chock full of consensus faults, app and chain hash failures, and all manner of errors from the DB and state layers — be convincing?

Orion-9R commented 2 years ago

Any Update on this?

I am a community dev on a cosmos-sdk chain. Fullnodes under load are reporting the app-hash missmatch and have to re-sync.

alexanderbez commented 2 years ago

We'll need think carefully for how to handle various DB-related errors when executing txs -- in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R :)

Orion-9R commented 2 years ago

We'll need think carefully for how to handle various DB-related errors when executing txs -- in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R :)

Keen to find a solution - its a burden on our ecosystem to have to kick fullnodes regularly. Default is 1000000 for nodes and they are no where close to that so I think there is something else going on.

yihuang commented 2 years ago

We'll need think carefully for how to handle various DB-related errors when executing txs -- in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R :)

Keen to find a solution - its a burden on our ecosystem to have to kick fullnodes regularly. Default is 1000000 for nodes and they are no where close to that so I think there is something else going on.

there could also be some genuine undeterministic logic, you can try to investigate the execution result of the block that produce the mismatched app hash.

alexanderbez commented 2 years ago

Yes, that too! My initial inclination is to type check the error in the panic recovery logic of tx execution -- if the error is of a specific DB fault, then we truly panic.

yihuang commented 1 year ago

We found a new case maybe related to this, there are some rare cases when a node get app-hash mismatch without any error logs, and check the commit info, we find one of the store committed an empty iavl tree. It could be triggered by a low level db error in the set operation, where the iavl root is reset with a nil, while the error is ignored by sdk, so eventually it commits the empty root.

peterbourgon commented 1 year ago

while the error is ignored by sdk

Neither AssertValidKey nor AssertValidValue return an error. (They can panic, but panics aren't errors, and shouldn't be managed like errors via i.e. recover.)

yihuang commented 1 year ago

while the error is ignored by sdk

Neither AssertValidKey nor AssertValidValue return an error. (They can panic, but panics aren't errors, and shouldn't be managed like errors via i.e. recover.)

The line no changed because the link is not a permanent one, I was refer to the st.tree.Set line, an error log is added yesterday as a conservative solution, maybe we should just panic.

peterbourgon commented 1 year ago

Can you link to what you're actually referring to?

yihuang commented 1 year ago

Can you link to what you're actually referring to?

https://github.com/cosmos/cosmos-sdk/blob/61effe82603006a7192cb311787ca8fc776a4461/store/iavl/store.go#L203

peterbourgon commented 1 year ago

Right. This goes back to my previous comments, I guess. The only viable solution here is to change the KVStore interface so that its methods return errors.

(Details!)

The fundamental problem here is that the BasicKVStore interface defines methods that don't return errors. That means those methods are defined as infallible: the KVStore asserts that a Get will always successfully return a []byte, a Set will always successfully set the key to the val, etc. Of course this isn't true, as you're experiencing right now. But when an implementation has an error, the KVStore interface doesn't allow you to report that error correctly, i.e. return an error to your caller. You have to either ignore the error (always wrong) or express it as a panic. Panics are generally meant to crash the process immediately. (Some exceptions exist for packages that manage panics at API boundaries, but this pattern is non-idiomatic, broadly discouraged, and only works in narrow circumstances which are definitely not satisfied by the SDK.) Your only option is to panic, but because panics are so widely misused as ersatz errors, recovers are littered basically everywhere, which means panic has no reliable outcome, which means there's no real way to solve this problem 😞

yihuang commented 1 year ago

Right. This goes back to my previous comments, I guess. The only viable solution here is to change the KVStore interface so that its methods return errors.

(Details!)

yeah, agree with that, and practically, stopping the consensus state machine immediately or causing an app hash mismatch in next block both halt the node, but the former one is easier to recover, you just fix the issue and restart.

peterbourgon commented 1 year ago

But how do you reliably stop the state machine or cause an app hash mismatch? AFAIK there's no way. Panic doesn't do it.

yihuang commented 1 year ago

But how do you reliably stop the state machine or cause an app hash mismatch? AFAIK there's no way. Panic doesn't do it.

panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine. but panic in message handler will be recovered and handled as a tx failure. And for "app hash mismatch", I mean as long as the node end up with a different state than the other nodes, it get a "app hash mismatch" and halted, for db errors, either ignore the error or cause a tx failure will end up with an inconsistent state, and halted with "app hash mismatch" failure.

peterbourgon commented 1 year ago

panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine.

This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can't assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can't do that. Panicking is preferable, though even that doesn't get you reliable guarantees.

yihuang commented 1 year ago

panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine.

This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can't assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can't do that. Panicking is preferable, though even that doesn't get you reliable guarantees.

I think it's part of abci protocol that if commit event handler(I guess any events in the consensus connection) don't return successfully, it won't proceed.

peterbourgon commented 1 year ago

KVStore methods are not necessarily called from ABCI event handlers.

Also, abci.Application defines no expectations re: panics as far as I can see, I'd be happy to see some docs otherwise! In any case, if a function doesn't return an error, then it's always successful by definition; panics aren't errors.

alexanderbez commented 1 year ago

Panics triggered (and not recovered) in non-tx-execution flows, e.g. BeginBlock, halt the node and prohibit it from proceeding, allowing a debugging procedure to happen. Conversely during tx flow, they are recovered and thus make it difficult to debug which @yihuang is alluding to.

yihuang commented 1 year ago

Panics triggered (and not recovered) in non-tx-execution flows, e.g. BeginBlock, halt the node and prohibit it from proceeding, allowing a debugging procedure to happen. Conversely during tx flow, they are recovered and thus make it difficult to debug which @yihuang is alluding to.

do you think we should panic on iavl set error?

alexanderbez commented 1 year ago

Yeah, if we're setting an invalid key/value pair, I think we should indeed panic.

peterbourgon commented 1 year ago

(Just want to observe that Set errors don't necessarily mean the key/val pair is invalid, they can also indicate invariant violations in the tree itself, problems during rebalancing, etc. But I suspect this is a meaningless distinction, as any code that's expected to panic on bad input should probably also panic on any other error, too.)

tac0turtle commented 11 months ago

closing this since there are ways to recover from apphash mismatches without having to resync

yihuang commented 11 months ago

closing this since there are ways to recover from apphash mismatches without having to resync

But it's still hard to diagnose the root cause of an app hash mismatch, if it's caused by system error, we can rollback and retry, but if it's non-deterministic logic, it's much more serious, but it's hard to tell at the first glance.

cosmos / cosmos-sdk