TheThingsNetwork / lorawan-stack

The Things Stack, an Open Source LoRaWAN Network Server
https://www.thethingsindustries.com/stack/
Apache License 2.0
975 stars 306 forks source link

Network Server logging #3658

Closed rvolosatovs closed 1 year ago

rvolosatovs commented 3 years ago

Summary

Reorganize Network Server logging

Why do we need this?

To simplify debugging. e.g. see https://github.com/TheThingsNetwork/lorawan-stack/issues/3637#issuecomment-758672243

What is already there? What do you see now?

Some important log statements are missing and some less relevant, low-level messages are printed

What is missing? What do you want to see?

Clear and simple log statements, which are easily understood by us and our customers

How do you propose to implement this?

Think about what we want to see from NS as a whole and go through existing statements.

How do you propose to test this?

Run the stack locally and send some traffic

Can you do this yourself and submit a Pull Request?

yes

nejraselimovic commented 3 years ago

@htdvisser I guess this is still relevant, but needs input from others too. Can you pick it up?

htdvisser commented 3 years ago

Requested input from @michalborkowski96 @i3mSuperMan @ymgupta.

michalborkowski96 commented 3 years ago

I'll copy-paste what I wrote on Slack:

When something breaks, there are typically two scenarios:

  1. Component completely dead
  2. Component mostly working, broken for a small number of users

When case 1 happens usually in the logs you see a few messages repeated endlessly. In this case the more messages the merrier, because it gives you more information about the actual problem. From my perspective it's very simple, because I already have a clear starting point.

When case 2 happens you need to catch useful information in a sea of messages related to normal operation. This is very difficult when I have only vague idea about how TTS works, so the less messages the better. This is why I was pushing for https://github.com/TheThingsIndustries/lorawan-stack/issues/2702 because then I would be able to import these messages somewhere, and filter out normal ones.

I don't really know what to say besides the fact that both types of problems occur and we just need to look for some kind of balance.

Regarding messages themselves:

  1. If a message keeps appearing all the time, and looks like normal operation, maybe it's worth it to remove it, or somehow aggregate (print once in a few seconds that an event happened that many times). In particular I think that aggregation sounds like a good idea, as it'd reduce the number of messages without much information loss.
  2. If fetching messages to my local computer takes longer than 15 minutes for the whole stack for last 24h, then we have way too many messages. If we're under this, or a similar sensible limit, we don't really need to remove messages. Otherwise we do, because this means that logs can't be searched through even by a machine.
  3. If an event is rare, it's worth logging, even if it's seemingly normal (for example completely valid, but super-rare device behaviour).
i3mSuperMan commented 3 years ago

Hi @htdvisser,

Currently, we have INFO and DEBUG level logs. For example: during a join, in INFO level logs, we see HandleJoin logs and in DEBUG level logs, we see additional info such as Join-request accepted by cluster-local Join Server and Database Operations (which doesn't seem to be helpful when we have thousands of logs).

Few suggestions:

  1. Introduce additional log levels such as FATAL and TRACE where fatal logs give info about super-rare device behavior and broken functionality. And, trace level logs for Database Operations and more info.
  2. INFO level logs to give more information to users stating the components involvement such as HandleUplink and forward to AS
  3. And add more logs to DEBUG level to provide info on the processes done by Network Server and their status.
htdvisser commented 2 years ago

Pushing this forward. No time, and I'm still hoping that eventually we can just remove the logging and use distributed tracing instead.

htdvisser commented 2 years ago

Nothing changed since my last comment. Still hoping that at some point we can just drop the logging and use distributed tracing instead. Pushing to Q3.

htdvisser commented 2 years ago

Refs https://github.com/TheThingsIndustries/company/issues/98#issuecomment-1177278973

cc: @NicolasMrad

KrishnaIyer commented 1 year ago

We've done quite some work internally on NS logging. There are no additional action items here in this issue. Users can open a new issue if something is missing.