MovingBlocks / Terasology

Terasology - open source voxel world
http://terasology.org
Apache License 2.0
3.67k stars 1.34k forks source link

Reduce log pollution with unnecessary stack traces #4966

Open jdrueckert opened 2 years ago

jdrueckert commented 2 years ago

Motivation

It's hard to find relevant information in the log output if it's polluted with stack traces of non-fatal errors or errors that are gracefully handled and mask the actual issue.

Proposal

Investigate options to reduce non-relevant information in the log, including

Consideration that may help with the decision-making for these options:

Additional notes

There's no general "this is the right way to do this". But we should investigate, brainstorm and discuss how we want to handle it. Also, once we got to an agreement, we might want to document this in the contributor guides.

keturn commented 2 years ago

My main suggestion on this is: Use the logging framework's configuration as much as possible. Then you have fewer decisions to make each time you add a log call, and one place where you can make updates on policy or add switches for different use cases.

keturn commented 2 years ago

Investing more in the logger's configuration may also involve a side discussion: Do we want to stick with slf4j? I think that part of the java ecosystem has shifted since we adopted it.

jdrueckert commented 2 years ago

My main suggestion on this is: Use the logging framework's configuration as much as possible. Then you have fewer decisions to make each time you add a log call, and one place where you can make updates on policy or add switches for different use cases.

I'm a bit concerned about configuring stack traces to not be present by default (please correct me if I read that wrong), because until we are at the point at which we improved the logging and error msg throughout the code base, we'll be likely hit with a lot of bug reports that are very hard to debug because we have neither the stack traces nor other relevant info.

Edit: what we could do is keep the stack traces configuration-wise at least when we start games from launcher. People that start from source can more easily be asked to reproduce with an additional flag, etc.

keturn commented 2 years ago

I think for this discussion we'll want to grab a few log files as case studies to center the conversation, and take notes on them as we go.

I am imagining this with a bunch of printouts and markers or crayons on a table, but I guess we will have to figure out some remote-friendly way to do it. 😆

jdrueckert commented 2 years ago

To start with some examples:

Example 1

This log was noticed when testing https://github.com/MovingBlocks/Terasology/pull/4962

20:45:54.252 [main] WARN  o.t.engine.registry.InjectionHelper - CoreCommands wanted NUIManager injected but CoreRegistry has none.
20:45:54.296 [main] ERROR o.t.engine.core.TerasologyEngine - Uncaught exception, attempting clean game shutdown
java.lang.RuntimeException: Game receive error and fall to main menu: [Missing required module or dependency]. See logs before
        at org.terasology.engine.core.modes.StateMainMenu.init(StateMainMenu.java:86)
        at org.terasology.engine.core.TerasologyEngine.switchState(TerasologyEngine.java:587)
        at org.terasology.engine.core.TerasologyEngine.processPendingState(TerasologyEngine.java:575)
        at org.terasology.engine.core.TerasologyEngine.tick(TerasologyEngine.java:479)
        at org.terasology.engine.core.TerasologyEngine.mainLoop(TerasologyEngine.java:459)
        at org.terasology.engine.core.TerasologyEngine.runMain(TerasologyEngine.java:435)
        at org.terasology.engine.core.TerasologyEngine.run(TerasologyEngine.java:401)
        at org.terasology.engine.Terasology.call(Terasology.java:180)
        at org.terasology.engine.Terasology.call(Terasology.java:69)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
        at picocli.CommandLine.access$1200(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
        at picocli.CommandLine.execute(CommandLine.java:2058)
        at org.terasology.engine.Terasology.main(Terasology.java:138)
20:45:54.297 [main] INFO  o.t.engine.core.TerasologyEngine - Shutting down Terasology...
java.lang.RuntimeException: Game receive error and fall to main menu: [Missing required module or dependency]. See logs before
        at org.terasology.engine.core.modes.StateMainMenu.init(StateMainMenu.java:86)
        at org.terasology.engine.core.TerasologyEngine.switchState(TerasologyEngine.java:587)
        at org.terasology.engine.core.TerasologyEngine.processPendingState(TerasologyEngine.java:575)
        at org.terasology.engine.core.TerasologyEngine.tick(TerasologyEngine.java:479)
        at org.terasology.engine.core.TerasologyEngine.mainLoop(TerasologyEngine.java:459)
        at org.terasology.engine.core.TerasologyEngine.runMain(TerasologyEngine.java:435)
        at org.terasology.engine.core.TerasologyEngine.run(TerasologyEngine.java:401)
        at org.terasology.engine.Terasology.call(Terasology.java:180)
        at org.terasology.engine.Terasology.call(Terasology.java:69)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
        at picocli.CommandLine.access$1200(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
        at picocli.CommandLine.execute(CommandLine.java:2058)
        at org.terasology.engine.Terasology.main(Terasology.java:138)
For more details, see the log files in /home/jdr/.local/share/terasology/logs/2021-11-27_20-45-52_5.2.0-SNAPSHOT

Issues with this log:

  1. The stack trace is printed twice
  2. The stack trace is printed at all, although the error even states that this is not the actual issue
  3. This is the last bit of logs you see, so intuitively this is where you try to find out what's going wrong (and there's people that don't read the error msg, but directly jump on the stack trace)

Example 2

This log can be reproduced currently (might be fixed in the future) by entering the "Join Game" menu.

21:34:13.531 [ServerList Downloader] ERROR o.t.e.r.n.l.m.ServerListDownloader - Error downloading online server list!
java.io.IOException: Server returned HTTP response code: 503 for URL: http://meta.terasology.org/servers/list
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1924)
        at java.base/sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1512)
        at java.base/sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1510)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/java.security.AccessController.doPrivilegedWithCombiner(AccessController.java:795)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1509)
        at java.base/java.net.URL.openStream(URL.java:1165)
        at org.terasology.engine.rendering.nui.layers.mainMenu.ServerListDownloader.download(ServerListDownloader.java:100)
        at org.terasology.engine.rendering.nui.layers.mainMenu.ServerListDownloader.download(ServerListDownloader.java:77)
        at java.base/java.lang.Thread.run(Thread.java:829)

Issues with this log:

  1. This error is non-fatal but prints a stack trace
  2. The stack trace is relatively irrelevant because the issue is not client-side (5xx http return codes are server errors)
  3. The error log should contain the exception msg
keturn commented 2 years ago

Example 1 (StateMainMenu.java:86)

Puzzling that it is printed twice. I wonder, does it appear in the log file twice? Or is one of those the log, and the other one is what is written to stderr on the terminal when the process exits due to exception?

If that is what's happening, I'm not sure which of them to suppress. 🤔

It does seem like the stack trace is less than useful, as it came though #processPendingState and it would be more useful to know what caused the state change, but you can't, in general, know what is a few levels up the stack when deciding to add a log message.

We do agree (as recently mentioned in #4960) that we want better patterns for managing these state transitions and their error handling in general.

In the short term, perhaps we should move that exception (because it's conditional based on the headless flag introduced in that PR) to when the messageOnLoad is set instead of waiting for init to be invoked?

keturn commented 2 years ago

Example 2 (ServerListDownloader)

I think I do not share your expectations about what an error being "non-fatal" has to do with the presence or absence of stack traces in the log. That is, this didn't result in a crash to desktop, but it clearly was fatal to ServerListDownloader. Something exceptional happened and it couldn't complete its job and now the system is in a state where it can't let the user do what they'd expected to do. Logging something about how it got in to that state sounds like an appropriate use of logger.error()to me.

Though in this case we can see that the author had some awareness that this might not be the best way to handle this exception: https://github.com/MovingBlocks/Terasology/blob/43e1b90bd7219ce4235988d27bc25856e27d8bdd/engine/src/main/java/org/terasology/engine/rendering/nui/layers/mainMenu/ServerListDownloader.java#L87-L92

Using Reactor's schedulers for asynchronous operations gives us an alternative to the suggested Thread.setUncaughtExceptionHandler. I'm not sure if that ends up being relevant, in the end, to how it's logged. I guess it's relevant because in the current implementation, the log was the only way of recording what happened, whereas the caller would at least have had the option of handling it differently with a better asynchronous API.

The stack trace is relatively irrelevant because the issue is not client-side (5xx http return codes are server errors)

Hmm. It takes a lot of special-case error handling logic to recognize that, as download could also have plenty of other exceptions that don't involve a 5xx response.

If cost were no issue, I agree we'd want error handling here that would give relevant guidance on the error. "Plug in the network cable" or "re-check the server hostname in the configuration" or "see status.github.com for updates, go for a walk until it comes back online"—depending on the type of exception.

Is this an error we expect to happen often enough to deserve that level of attention?