Noisy trace logs suddenly appeared in newly-built charms

paulgear commented 5 years ago

I recently rebuilt the ntp charm, and was surprised to find this in my logs on every invocation of the update-status hook: https://pastebin.ubuntu.com/p/DtgHjTkBXS/ - the update-status hook was already a bit too noisy, IMO, and now it's far worse.

I've discussed this already with @stub42, who has explained why charms.reactive has added flag tracing by default, but I'm creating this issue to explain why this is unexpected and undesirable (in no particular order):

It is logging to the juju debug log at DEBUG level, even though it's TRACE-level logging. So it's factually incorrect, and violates the logging level which is set in juju model-config logging-config.
It's logging to the juju debug log, which means it goes into mongodb on the controller. Historically, this has caused problems on high-scale controllers where the logging volume results in I/O load problems.
It's unnecessary for simple, stable charms and thus shouldn't be on by default.
It includes multi-line logs, making them less machine-parseable.

Possible solutions (non-exhaustive):

(Most desirable) Log at juju trace log level. This would mitigate reasons 1-3 above.
(Strongly preferred) Disable tracing by default and require charms to make the conscious decision to enable it. This would mitigate reasons 2-3 above.
Log to a different log file (or trace state in a dedicated database or the like) which is local to the juju unit. This would mitigate reasons 2 & 4 above, and somewhat lessen the importance of 3.

stub42 commented 5 years ago

I think the best way forward is to improve the logging, making it less verbose or more compact, but hopefully without reducing readability. What is currently being logged is my best attempt, but it is also my first attempt and knowingly a compromise.

Although hesitant at first, I do think having this logging on by default is necessary. Opt-in does not work here, as opting in happens after a problem and after the information that needed to be captured has been lost. It would place an additional burden on developers, with yet another default not set to best practice. It also places a burden on layer and interface maintainers, where the information they need to help diagnose a problem in a charm is not available. We can make it easier to turn off, but I certainly wouldn't recommend doing it and charm authors need to be aware of the burden they are imposing on themselves; by opting out, they become responsible for diagnosing problems and issues in their charm, particularly on how they integrate with layers, interfaces and the charms.reactive library itself. When bug reports come in, the logs will be lacking information needed to see what happened, and the charm author will need to reproduce the bug, probably with the logging reenabled, before they escalate the issue to the maintainers of the dependencies. It would be unfair to make dependency maintainers reverse engineer the problem from partial logs, when they are partial due to an explicit choice made by the charm author. This is another reason why opt-in isn't going to work here; as a layer and interface author who needs this information to diagnose real issues, if charms.reactive did not enable the feature by default, I would be enabling it by default in my layers and interfaces. Any layer or interface being used by the charm could opt-in or opt-out, and a charm author who wanted to turn off the debug logs would need to opt out anyway, no matter what the charms.reactive default was.

The difference between debug logging and trace logging is subjective, and generally the differentiator is what you want enabled when in 'debug' mode and what you want enabled in 'verbose debug' mode. The goal is that this information is to be included in the debug logs, along with the rest of the stuff needed to diagnose problems (such as charms.reactive logging which handlers are invoked, or the apt layer stating that it added a signing key as specified in layer.yaml), and included in the logs by default, so that bug reports can include the information without someone needing to reverse engineer how to reproduce the problem and reproduce it with logging switched on. The naming choice of 'tracer' in the logs can certainly be changed as part of improving how the information is presented. It is a poor term to use, as it tracking high level state changes; the flags, and the handlers that may potentially be called. It is not tracking actual execution (charms.reactive already does that, logging the hook start and which reactive handlers are run, at INFO level because this is important to see even when not diagnosing bugs and watching logs at DEBUG level).

The debug logging is going to the the unit log, which is using it for its intended purpose. Charms are supposed to use juju-log to log messages, which juju takes responsibility for, handling rotation and related lifecycle/cleanup, and providing tools to monitor and filter (primarily juju debug-log). I do not think the best way of addressing Juju load issues is to not use the provided tools for their intended purposes. If the historical problems of logging volume causing high load on controllers is still an issue, then this scaling problem needs to be fixed in Juju itself; using provided features less does not make scalability problems go away, it just means they trigger at a higher scale.

A key point of this logging is that it is debug logging. It is not intended to be viewed on stable charms or in day to day operations. It is perhaps unfortunate that the only command juju provides for monitoring logs is 'debug-log', which can be adjusted to filter out the debug message. Perhaps juju needs a model-logs or unit-logs commands which, by default, monitor the unit logs at INFO level. There is only one log stream available, and insisting it is only used for one purpose (say, inspecting what a stable charm is doing) means it is unavailable for all the other purposes (say, diagnosing why a simple charm is failing to do anything useful).

I agree that the multi-line messages are problematic. Ideally we should not be machine-parsing logs that are, unfortunately by design, unparsable, and instead pulling them out of the Juju API in a structured format My main concern is it makes them less readable and harder for humans to parse, as the meaningful information keeps jumping from the middle of the screen (lines with the prefixes and timestamps) to the far left of the screen. The only reason I did the extra work required to make them multi-line is to avoid the overhead of running the juju-log tool, because it is really slow. This is not a problem specific to charms.reactive, but to all charms, and we should improve the juju-log tool (addressing https://bugs.launchpad.net/bugs/1274460 would be a good start, as we could invoke juju-log once and stream messages to it). In the time being, we can certainly revisit the decision to emit multi-line log messages if we can bare the cost (I haven't benchmarked it), and it will immediately make the debug logs more readable.

maaudet-ca commented 4 years ago

Right now, I have charms where those logs take up about 90% of the log space that could be use for actually useful logs. I would like to be able to choose if I log those TRACE level logs into my debug logs or not.

canonical / charms.reactive

Noisy trace logs suddenly appeared in newly-built charms #206