Principles for choosing logging severity

leventov commented 5 years ago

Motivation

There are no guidelines for choosing logging severity (fatal, error, warning, info, debug, and trace) for developers. Logging is inconsistent in different classes.

Proposed principles

Assertion error

Currently, Druid's Logger class has wtf methods which make error-level statements with "WTF?! :" prefix. First of all, I suggest to rename it to assertionError. They have "ASSERTION_ERROR" prefix instead.

These are for the situations when coding assumptions are broken, something that is considered impossible happens. We would otherwise throw AssertionError if Druid didn't want to be more resilient.

Note: something like catching IOException from network or disk I/O methods shouldn't log on this level, because such methods can throw IOException. This should be reserved for situations that are considered truly impossible on "Java language" level.

Note: sometimes, InterruptedExceptions should be logged on this level (rather than ERROR), when we truly don't except that anybody would interrupt our call.

Action for Druid cluster operators:

Track "ASSERTION_ERROR"
Always file an issue to Druid if "ASSERTION_ERROR" appears in logs.

Error

Failed query (due to insufficient resources, exceptions, etc.)
Failed network or disk I/O operation
Failed to load a segment on a node
Failed task
Timeouts
Usually, code like catch (Exception e) {} should log on this level.

Note: in retry code, each failed our timed out attempt should log on this level, not the warning level.

Action for Druid cluster operators:

Track "ERROR".
Consider filing an issue to Druid if "ERROR" (without "ASSERTION_ERROR" following) appears in logs. Not all such errors are problems with Druid, e. g. problem with network in the cluster may result in a spike in I/O action and query errors.

Warning

Calling to an endpoint that was renamed, but the former name is still accepted for backward compatibility (e. g. see #7361)
Druid detects something that may be a sign of misconfiguration of a query, or a node, or the whole cluster (e. g. too much time spend in balancing segments)
Druid detects that some cluster resource is about to exhaust (e. g. capacity of a tier)

Action for Druid cluster operators:

Track "WARNING".
Consider taking some action: call into new endpoint name, reconfigure the cluster, add historical nodes, etc. if "WARNING" appears in logs.

If there are no bugs in Druid code, "WARNING" should never be something Druid cluster operators file an issue about directly. This is another criterion for choosing between the error and the warning levels for any logging statement.

Info

Some "interesting" information that may be helpful for introspection of Druid node or cluster operation, but doesn't qualify to be a warning.

Action for Druid cluster operators: none, unless during an investigation of some bug or error condition.

Note: information that is too mundane to be useful even for bug or error investigation should not be logged as info (and not logged at all, as proposed below)

Debug, Trace

These levels are not used in production code. Logging statements on these levels can be added during testing to inspect something (if debugging doesn't work or help) but should be removed before committing code.

Rationale: if a logging statement doesn't deserve to be "info", it's net value by contaminating the code is negative.

Future work

Fixing all existing logging statements in the codebase is out of the scope of this proposal. It should be a gradual community effort after the principles are worked out.

After we remove all debug and trace logging from production code (maybe making some of them "info" instead), we can add static analysis checks to prohibit adding new such statements in production code.

jon-wei commented 5 years ago

I think that's a pretty reasonable breakdown for logging conventions.

Replacing "WTF" with the more professional-sounding "ASSERTION_ERROR" is good too, and I'm okay with dropping debug/trace logs.

egor-ryashin commented 5 years ago

I don't agree with dropping trace or debug logging, actually. The counterargument usually those can/should be introduced during testing only, well, sometimes there's a bug in production which is not reproduced in the testing environment. If there's no debug/trace coded at all, one has to introduce the logging and upgrade the production cluster. In my experience, if debug/trace logging doesn't help to find out where the root cause is, they at least help to identify where it is not.

About INFO, just wonder if segment load action should be logged assuming you can get the segment location using HTTP API?

leventov commented 5 years ago

@egor-ryashin this is the premise of adding debug and trace statements. However, in practice, I don't know of a single case when debug/trace logging helped to find a bug in production in Druid. So I think the benefit of keeping all these logs doesn't justify the cost. If somebody has a feeling that some logging statement should likely help to find bugs then it should be on the info level rather than on the debug level.

When dealing with a specific bug, a developer may know much better what and where exactly should be logged to help to find the specific bug, and all these places are very unlikely to be covered by pre-existing debug logging. So it's likely that they would need to add more logging, rebuild and redeploy. Solving this problem with a brute force, by adding debug or trace statements abundantly doesn't seem to be a viable solution.

About INFO, just wonder if segment load action should be logged

I think it should be logged only if it led to an exception during loading. That is, the exception should just be wrapped with another exception with a message like "a error while loading segment [my_data_source_startDate_endDate]", or something like that.

assuming you can get the segment location using HTTP API

I didn't understand this part.

gianm commented 5 years ago

Three thoughts on this that are maybe related, but also stand alone:

1) I have definitely used debug/trace logs to find bugs before, but I would say even more often is using them to find root causes of problems that are not really bugs. For example, the debug logs for memory allocations are at debug level because they are too noisy to log in normal operation, but are useful to enable when trying to figure out what is eating up all the memory. For another example, debug logs that log each HTTP request/response are useful for tracking down the root cause of some service-to-service communication problem, or for tracking down the root cause of an unexpected query result.

2) Sometimes, when writing code, debug or trace logs for me replace comments. Instead of writing a comment like 'Skip zero length chunk' we can include a log statement like log.debug("Skipping zero length chunk");. I guess this would be fine as a comment too, but making it a comment doesn't really make the code cleaner (same number of lines), and it lets people enable debug logs so they can see how control flows around. I do this sometimes when writing or debugging unit tests.

3) I generally feel that the INFO logs in Druid are too noisy today, but some of them are not things I'd want to remove, due to points 1 & 2 above. So I would support a program of demoting more INFO logs to DEBUG level, and DEBUG to TRACE, with the idea being that Druid logging by default should be a lot 'quieter', but you can enable DEBUG level logging to get what you get with INFO today, and TRACE to get things like each and every HTTP request/response and memory allocation.

leventov commented 5 years ago

@egor-ryashin @gianm ok. Could you please suggest criteria for how to choose between INFO and DEBUG?

@gianm I still disagree, however, with the suggestion about TRACE. Regarding memory allocations, it should be possible to profile them much more effectively using async-profiler in allocation mode. Tracing each and every HTTP request/response should be enabled by some Jetty flag, or if Jetty doesn't have such, by a special Druid's flag because while you want to debug each and every HTTP request and response, you probably don't want to trace, for example, each and every [insert other type of very noisy thing here], and vice versa. TRACE is so noisy that use cases should be specialized. So I think it should rather be

if (specialFlagIsOn) {
  log.info()
}

Than log.trace().

leventov commented 5 years ago

Another alternative is delegating all HTTP logging, for example, to a class HttpLogging and enable class-specific TRACE logging level.

gianm commented 5 years ago

TRACE is so noisy that use cases should be specialized.

Usually I'd enable TRACE logging for a specific package, rather than across the entire codebase, using log4j's per-package log level config features. Blanket enabling of TRACE is way too noisy and sometimes even quickly fills up the filesystem.

Regarding memory allocations, it should be possible to profile them much more effectively using async-profiler in allocation mode.

Ah, I really meant off-heap allocations (like decompression / dictionary merging buffers, etc). And sometimes I'm lazy and using logs is quicker/easier, especially when interacting with a Druid cluster that runs in an environment that I don't control and can't install extra software into.

@egor-ryashin @gianm ok. Could you please suggest criteria for how to choose between INFO and DEBUG?

I'm not sure what a specific set of criteria would look like but I'll think about it. I feel that right now, the noisiest log messages on coordinator/historical are the ones that deal with specific segments (assignment / load / drop), so maybe it's best to think about those first. On the overlord and in tasks, I feel the situation is even worse, and there is way too much chatter about locks, allocations, and so on. Those also tend to be about specific segments though (or sometimes specific time chunks that are getting written to). The Kafka supervisors are also pretty noisy and tend to log a lot about what they're thinking at any given moment, probably too much.

I think ones that would be good to keep include anything that updates the metadata store (so any sort of segment allocation or publish). But right now we tend to have multiple log lines for each event like that, and it'd be best to consolidate those down.

I think it'd also be good to keep commands from the coordinator to historicals (it asking them to load/drop stuff) but I'm not sure we need them on both the coordinator and the historicals, like we have today. Maybe keep them as INFO on the historicals but demote them to DEBUG on the coordinator? Not as sure about that one, just thinking out loud.

leventov commented 5 years ago

Ah, I really meant off-heap allocations (like decompression / dictionary merging buffers, etc)

You can profile them with async-profiler too, using -e malloc option. And it doesn't require installation on a machine, just downloading.

egor-ryashin commented 5 years ago

Could you please suggest criteria for how to choose between INFO and DEBUG?

"In general, trace level logging cannot be on for sustained periods because it degrades the performance of the application greatly, and/or creates an abundance of log data that is unsustainable due to disk/bandwidth constraints.

Debug level logging can usually be on for a longer period without making the app unusable."

leventov commented 5 years ago

My goal is to create criteria that can be more or less easily and unambiguously used during code reviews. Such as "Druid cluster operators should react to this logging statement? Yes/No" criteria for WARN/INFO pair.

Probably the criteria should be topic-based. For example, "on query nodes (historicals/brokers/routers/indexing processes on middle managers), nothing per-query, per-HTTP request/response, or per-segment should be on INFO level, only DEBUG".

leventov commented 5 years ago

Another suggested principle/checklist item: don't do log.error("...", e.getCause()), use just e instead to provide cluster operators and developers with full stack information.

apache / druid