Closed yihuang closed 11 months ago
Hey guys I would not consider this a bug.
Instead, I would include directions on setting the number of open files value. In most Linux distributions the command to set them is this:
ulimit -n 500000
That will set it for the length of the session.
It can also be set in systemd units
and in the system file /etc/security/limits.conf
Maybe not a bug, just an improvement, since the app-hash mismatch situation is much hard to recover from.
I'm not sure if there are other low level errors which could trigger the same issue.
Maybe we can add a check?
This is actually frequently encountered. The scenario are describing here is very familiar to me. I run into it basically anytime that I have failed to shut the file system limit. Some interesting feedback on this can be found in my performance branch of tm-db, which is basically the combination of terra's performance branch, and some additional tweaks that I have made.
Insight:
maybe sdk should check if that limit is above 33,000.
We use the 500,000 value on servers that run many nodes for relaying.
do you think there are other low-level DB errors that should not be ignored in the consensus level? I think we should separate out some system-level errors to stop the node rather than commit a wrong block state.
The error message we see:
recovered: can't get node 6953892A1BDD69D87894970AE3CD3C04B25C9F3CA7774AF1ACD8DBF58ABE4F5D: open /chain/.cronosd/data/application.db/535337124.ldb: too many open files
I think the general reaction to a DB error should be halt the node.
The particular case of a too many files opened is a node operator level error, but the general case of the many db errors that can occur is not.
Should we change tm-db's apis to return error, and those errors should abort the node?
I think the general reaction to a DB error should be halt the node.
The particular case of a too many files opened is a node operator level error, but the general case of the many db errors that can occur is not.
Unfortunately, it's not possible for the SDK to effectively and/or safely halt the node.
Calling panic should terminate the process and halt the node, of course. But because panic is used throughout the ecosystem as a normal way to signal errors, developers necessarily must use recover as a normal way to capture and handle errors, and effectively no deferred recover statement bubbles up caught panics. That means code can't make any assumptions about the effect of a panic, beyond that it exits the current call frame.
This unpredictable behavior has real consequences! I've been running some full nodes across several Cosmos-based networks recently. Stopping and re-starting those processes is always a highly stressful experience. The SIGINT isn't caught and managed in any well-defined way, and essentially triggers panics, errors, or nothing at all, across all active call stacks. If one of those call stacks happens to be in the middle of an ABCI Commit, or midway through persisting some state to disk, or any one of a dozen more fragile operations like the one identified in this issue, then the state gets corrupted, almost always in a non-recoverable way, and I have to rm -rf
and start again from a snapshot.
More generally, errors encountered when operations cross an abstraction layer — e.g. DB errors, among many others — are, like most other runtime errors, totally normal. A read transaction can timeout because the underlying resource is busy with other callers. A write transaction can fail because the optimistic concurrency assumptions made by the query planner don't hold. A remote read or write can fail because of arbitrary and random latency at one of hundreds of potential levels of translation. These situations aren't exceptional, and they almost never signal malice or Byzantine behavior. There's no reason for the SDK to halt the chain if a DB write times out, for example. Doing so makes the SDK fundamentally unreliable software.
But because panic is used throughout the ecosystem as a normal way to signal errors
What do you mean by ecosystem here? The broader Go ecosystem, or the SDK? If the latter, that's not the case. Panics are only triggered when the desired affect is to halt the node, the exception being message execution (which indicates message failure), otherwise, e.g. panic in EndBlock
, the node will gracefully halt.
I've been running some full nodes across several Cosmos-based networks recently. Stopping and re-starting those processes is always a highly stressful experience.
How so? I've never had any issues restarting nodes. I typically restart nodes multiple times a day without any issues.
The SIGINT isn't caught and managed in any well-defined way, and essentially triggers panics, errors, or nothing at all, across all active call stacks.
This is not true. SIGINT is captured in a single place.
What do you mean by ecosystem here? The broader Go ecosystem, or the SDK? If the latter, that's not the case. Panics are only triggered when the desired affect is to halt the node,
Panics are absolutely used as an alternative way to signal errors throughout Tendermint and the SDK. Consider the BasicKVStore interface:
// BasicKVStore is a simple interface to get/set data
type BasicKVStore interface {
// Get returns nil if key doesn't exist. Panics on nil key.
Get(key []byte) []byte
// Has checks if a key exists. Panics on nil key.
Has(key []byte) bool
// Set sets the key. Panics on nil key or value.
Set(key, value []byte)
// Delete deletes the key. Panics on nil key.
Delete(key []byte)
}
I was prepared write a fair bit more to explain the details of this example. About how all of these operations are in fact fallible, and how fallible operations in Go are idiomatically modeled by (..., error)
returns, and why this way of modeling fallible operations is important, and etc. etc. — but the comments here really do all of the work for me 😉 All of these methods can fail for arbitrary, implementation-specific reasons. They should all should return errors. Providing a nil key to a Get method is perhaps a bug, but it's surely not a condition that should terminate the process. And this is one of countless examples.
the exception being message execution (which indicates message failure), otherwise, e.g. panic in EndBlock, the node will gracefully halt.
First, while the SDK can issue a panic, it can't make any assertions about the effects of that panic on the node. That's because the SDK is always encapsulated by the application, and applications routinely recover from panics as a matter of practical necessity.
Second, even if the SDK could reliably assert that a panic would actually bubble up to the main goroutine of execution, by definition that can't be assumed to result in graceful termination of the process. Just at a base level, because that panic is invisible to any goroutines spawned from outside of the panicing call stack; those goroutines will, practically, be hard-killed as a result. But in addition, because Go code in general can't make any assumptions about panic recovery beyond its API borders. Specifically, I mean that a Cosmos app is never obliged, and can't be assumed, to recover from panics thrown by the SDK. That's an invariant defined by the language. And since the application owns the lifecycle of many of the important stateful components in a node, the SDK unfortunately can't make any assertions about how panics will affect them.
How so? I've never had any issues restarting nodes. I typically restart nodes multiple times a day without any issues . . . SIGINT is captured in a single place.
I guess I'm not sure how to respond. Which networks do you run? Would a collection of my journald logs — chock full of consensus faults, app and chain hash failures, and all manner of errors from the DB and state layers — be convincing?
Any Update on this?
I am a community dev on a cosmos-sdk chain. Fullnodes under load are reporting the app-hash missmatch and have to re-sync.
We'll need think carefully for how to handle various DB-related errors when executing txs -- in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R :)
We'll need think carefully for how to handle various DB-related errors when executing txs -- in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R :)
Keen to find a solution - its a burden on our ecosystem to have to kick fullnodes regularly. Default is 1000000 for nodes and they are no where close to that so I think there is something else going on.
We'll need think carefully for how to handle various DB-related errors when executing txs -- in the meantime, I would advise bumping the file limit on your system as others have pointed out @Orion-9R :)
Keen to find a solution - its a burden on our ecosystem to have to kick fullnodes regularly. Default is 1000000 for nodes and they are no where close to that so I think there is something else going on.
there could also be some genuine undeterministic logic, you can try to investigate the execution result of the block that produce the mismatched app hash.
Yes, that too! My initial inclination is to type check the error in the panic recovery logic of tx execution -- if the error is of a specific DB fault, then we truly panic.
We found a new case maybe related to this, there are some rare cases when a node get app-hash mismatch without any error logs, and check the commit info, we find one of the store committed an empty iavl tree. It could be triggered by a low level db error in the set operation, where the iavl root is reset with a nil
, while the error is ignored by sdk, so eventually it commits the empty root.
while the error is ignored by sdk
Neither AssertValidKey nor AssertValidValue return an error. (They can panic, but panics aren't errors, and shouldn't be managed like errors via i.e. recover.)
while the error is ignored by sdk
Neither AssertValidKey nor AssertValidValue return an error. (They can panic, but panics aren't errors, and shouldn't be managed like errors via i.e. recover.)
The line no changed because the link is not a permanent one, I was refer to the st.tree.Set
line, an error log is added yesterday as a conservative solution, maybe we should just panic.
Can you link to what you're actually referring to?
Can you link to what you're actually referring to?
Right. This goes back to my previous comments, I guess. The only viable solution here is to change the KVStore interface so that its methods return errors.
Right. This goes back to my previous comments, I guess. The only viable solution here is to change the KVStore interface so that its methods return errors.
(Details!)
yeah, agree with that, and practically, stopping the consensus state machine immediately or causing an app hash mismatch in next block both halt the node, but the former one is easier to recover, you just fix the issue and restart.
But how do you reliably stop the state machine or cause an app hash mismatch? AFAIK there's no way. Panic doesn't do it.
But how do you reliably stop the state machine or cause an app hash mismatch? AFAIK there's no way. Panic doesn't do it.
panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE
log and stop consensus state machine. but panic in message handler will be recovered and handled as a tx failure.
And for "app hash mismatch", I mean as long as the node end up with a different state than the other nodes, it get a "app hash mismatch" and halted, for db errors, either ignore the error or cause a tx failure will end up with an inconsistent state, and halted with "app hash mismatch" failure.
panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine.
This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can't assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can't do that. Panicking is preferable, though even that doesn't get you reliable guarantees.
panic in the abci event handlers will cause tendermint to output CONSENSUS FAILURE log and stop consensus state machine.
This code is exported, so it can be called by any consumer, not just ABCI event handlers. And panics that traverse exported API boundaries like this one have undefined behavior, callers can intercept them before they reach upper layers. So basically you can't assume these things. If an implementation of KVStore.Set has an error, logging that error and continuing will put the store in an invalid state. Can't do that. Panicking is preferable, though even that doesn't get you reliable guarantees.
I think it's part of abci protocol that if commit event handler(I guess any events in the consensus connection) don't return successfully, it won't proceed.
KVStore methods are not necessarily called from ABCI event handlers.
Also, abci.Application defines no expectations re: panics as far as I can see, I'd be happy to see some docs otherwise! In any case, if a function doesn't return an error, then it's always successful by definition; panics aren't errors.
Panics triggered (and not recovered) in non-tx-execution flows, e.g. BeginBlock
, halt the node and prohibit it from proceeding, allowing a debugging procedure to happen. Conversely during tx flow, they are recovered and thus make it difficult to debug which @yihuang is alluding to.
Panics triggered (and not recovered) in non-tx-execution flows, e.g.
BeginBlock
, halt the node and prohibit it from proceeding, allowing a debugging procedure to happen. Conversely during tx flow, they are recovered and thus make it difficult to debug which @yihuang is alluding to.
do you think we should panic on iavl set error?
Yeah, if we're setting an invalid key/value pair, I think we should indeed panic.
(Just want to observe that Set errors don't necessarily mean the key/val pair is invalid, they can also indicate invariant violations in the tree itself, problems during rebalancing, etc. But I suspect this is a meaningless distinction, as any code that's expected to panic on bad input should probably also panic on any other error, too.)
closing this since there are ways to recover from apphash mismatches without having to resync
closing this since there are ways to recover from apphash mismatches without having to resync
But it's still hard to diagnose the root cause of an app hash mismatch, if it's caused by system error, we can rollback and retry, but if it's non-deterministic logic, it's much more serious, but it's hard to tell at the first glance.
Summary of Bug
During delivering tx, when there are errors happening in low-level DB, for example, "too many opened files", tm-db turns them into panic which is recovered by the tx runner and treated as a failed tx execution result, which results in state inconsistency and manifested as an app-hash mismatch error in the next block. Maybe it's better to stop processing the current block in such cases.
Edit: A new case that we think may be related to this: https://github.com/cosmos/cosmos-sdk/issues/12012#issuecomment-1308209563
Version
Steps to Reproduce
For Admin Use