cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.3k stars 3.64k forks source link

[Feature]: During an upgrade, the logger should be forced to have a minimum of info. #22522

Open SpicyLemon opened 2 weeks ago

SpicyLemon commented 2 weeks ago

Summary

As a blockchain maintainer and upgrade coordinator, I want nodes configured with a log level of "error" to also output info-level messages during an upgrade so that I can more easily identify what might be going wrong and also so that node operators are less likely to kill the process in the middle of a long upgrade because they think it's stuck. If the node is configured with a higher log level (e.g. debug), then the upgrade should still happen with debug level log output.

Problem Definition

During long-running upgrades, we've found that node operators often interrupt the upgrade thinking that it's stuck and not doing anything. This usually leaves the state store unusable and they then have to restore from a backup which they often didn't make, so they have to track one down. In our upgrades, we make sure to include plenty of logging about what's happening during an upgrade, and those are always info level. Most validators run their nodes with only error logging enabled though. So they don't get to see that the upgrade is doing things. We can tell people to switch to info level during upgrades, but then they have to kill the process and switch back when done.

Further, sometimes things go wrong during an upgrade. Without those info-level messages, it's impossible to know which area to look in while trying to troubleshoot. Even knowing which module is being processed would be a major improvement. But again, those are info-level messages that won't be seen by nodes configured for error level.

The log types used by the SDK make it impossible to detect what log level a logger has and also make it impossible to either change the log level or create a copy of an existing logger, but with a different log level.

Proposed Feature

Create something like NewLoggerWithIncreasedLogLevel(logger, level) that would return itself if the provided logger has at least the provided level, or else create a copy of the logger with the provided level. Then, SDK users could call that at the start of their upgrade handlers so that it'd all run with the desired level, but revert back to the configured one once the upgrade is done and the node is running again. You (the SDK) could even use that to update the logger prior to calling the upgrade handler.

Relatedly, it'd also be nice to see the SDK add periodic log messages during its own upgrade processes. All of the long-running module upgrades I've seen from the SDK haven't had any logging. So even when running with debug level logs, nothing is printed to the logs for more than 30 minutes in some cases. At the very least, it should log when its starting/ending a module upgrade.

julienrbrt commented 1 week ago

Hi,

Relatedly, it'd also be nice to see the SDK add periodic log messages during its own upgrade processes. All of the long-running module upgrades I've seen from the SDK haven't had any logging. So even when running with debug level logs, nothing is printed to the logs for more than 30 minutes in some cases. At the very least, it should log when its starting/ending a module upgrade.

This seems like a no-brainer yeah, I actually thought we already had this.

Concerning automatically changing the logs before an upgrade, that maybe something to do on the chain side. The SDK provides server.CreateSDKLogger methods to create the global logger for the whole app. If a chain wants to change how it behaves, they should override that call and handle custom logic. The logger, as you said, once set, cannot be changed, and this isn't something we can change easily.

The other easiest way, is simply to tell node operator to set COSMOVISOR_CUSTOM_PREUPGRADE to a bash scripts that updates the app logger to info (Editing the config.toml log_level to info).