Integrate Structured Logging

CMCDragonkai commented 2 years ago

Specification

Structured logging is to use JSON on the STDERR logs. This exposes more of the data, and allows us to create a standard for the set of fields to be used.

There's no real official standard for structured logging. But journalctl and syslog defines some common fields. The fields can be similar to our formatting that we already use, but this time in a JSON format. Additional data fields can be encoded too.

We could put a json formatting in somewhere that provides the JSON serialisation.

This can be especially useful since we are also looking into adding error chaining, which we have done in a few places. But by creating our own @matrixai/error we can implement error chaining. https://github.com/MatrixAI/js-polykey/issues/304

Additional context

https://github.com/MatrixAI/js-polykey/issues/304
https://www.structlog.org/en/stable/
https://github.com/uber-go/zap
https://github.com/MatrixAI/js-polykey/pull/283#issuecomment-986400664 - where an error came up and no context was there, but all in the data which required manual serialisation

Tasks

Research different structured logging libraries
Compare with opentracing or tracing based data formats
Implement simple structured logging capabilities into js-logger

CMCDragonkai commented 2 years ago

Need to consider a couple of things:

Common fields that should align with other structured logging mechanisms
The --format field used for output formatting can double as for logger formatting too
How does error chaining come into play here, and how should these be shown?
How does stack traces come into play here? Stack traces don't have sensitive information, they are just traces of what functions called what functions.

CMCDragonkai commented 2 years ago

So basically the problem is this:

Does --format apply to just STDOUT or also to STDERR? I believe it should apply to both. Especially as STDERR will likely be relied on upon quite a bit by pk agent start. So doing something like pk agent start --format=json --verbose will then ensure that INFO messages get logged out to STDERR by with JSON structure rather than human readable messages. This will be important for any kind of long-term debugging of the PK agent.
If --format applies to STDERR as well, this then affects the PK CLI calls which would mean any exception plus INFO messages would also become JSON. This is OK as we can expect that once --format=json is used, we would actually want this to be worked by a computer.

Therefore any usage by a computer should always use --format=json. It would affect exception output on STDERR, logging messages to STDERR, and output to STDOUT.

CMCDragonkai commented 2 years ago

One additional issue is the "level of verbosity". This is basically controlled by --verbose. But when we output an exception to STDERR, we don't currently show the stack trace atm. The root exception handler by the PK agent and in PK CLI calls don't show the stack trace. The stack trace is useful for tracing where the error comes from.

Right now verbosity only controls INFO, WARNING... etc. We could do something where one chooses to use logger.debug to show the stacktrace as well if the logger level is appropriate. But this requires a conditional check on the logger's effective level.

Something like:

if (this.logger.getEffectiveLevel() > ...) {
  this.logger.debug(e, ...); // include stack trace
} else {
  this.logger.info(e, ...); // don't include stack trace
}

A similar situation would apply to errors being thrown over the GRPC boundary as discussed here https://github.com/MatrixAI/js-polykey/issues/249.

I reckon something can be automated if we integrate the design of js-logger with js-errors library here https://github.com/MatrixAI/js-polykey/issues/304, such that when an error is being logged out, we gain additional context depending on what the log level is. Of course what errors is passed over GRPC also depends, some errors we only want to write to the agent's STDERR, other errors we want to propagate to the GRPC client, and in other cases we would want to "map" agent-local errors to GRPC API errors.

This would mean something like catch a local agent error, write that out to STDERR on the agent on the RPC handler, then create a new error to be propagated to the GRPC client. We have to separate client errors from server errors like in HTTP where there is 4xx errors vs 5xx errors. 5xx errors would be caught by the exit handler, and put to STDERR. 4xx errors are not, they would be emitted by the RPC handler. We would only log them out to STDERR if the RPC handler handled the error, and also that this is a problem for the agent itself.

CMCDragonkai commented 2 years ago

Node v16 should be reporting the error chain nicely now.

CMCDragonkai commented 2 years ago

@emmacasolin FYI this issue addresses the idea of how the exceptions (including the error chaining) will eventually be displayed on the CLI or STDERR logging.

CMCDragonkai commented 2 years ago

Note that the cause property is actually an ES2022 feature, but nodev16 does support it, but because we haven't fully updated our standards to it, we will forget that it actually exists on the standard errors.

CMCDragonkai commented 2 years ago

Any kind of structured logging will be intended for machine consumption and filtering. That's because any logging of server errors may contain a long stack string, and that information is likely to be quite noisy on basic log readers.

CMCDragonkai commented 2 years ago

@emmacasolin this would still be needed to ensure that --format json affects STDERR logging.

CMCDragonkai commented 2 years ago

Some thoughts...

We've been hitting various performance and other network buggy issues which are hard to debug because of concurrency and that asynchronous work are happening out of order.

I think we need to improve our structured logging to provide a better observability into the runtime of our program.

To do this, I'm thinking that we need to incorporate some new ideas:

use perf_hooks to get proper timing data: https://blog.logrocket.com/experimental-node-js-testing-the-new-performance-hooks-31fcdd2a747e/
use async_hooks to hook into Node's asynchronous contexts to understand when something asynchronous is running, and the architecture of them together
We should include some new formatting symbols like trace from #4 and also time stamp symbols too
We need a log viewer/trace viewer that we can use locally (possibly in the browser, possibly in the terminal) that can understand the structured logs and give us context:
Get some ideas of prior art with respect to opentelemetry and related structured logging systems:
- https://github.com/tokio-rs/tracing
- https://github.com/tersesystems/blindsight
JSON vs Structure logging:
- https://github.com/hynek/structlog (I believe human readable is structured logging)
- https://github.com/sirupsen/logrus
Right now we have a different way of reporting exceptions/errors vs our logging, that is when exceptions are reported we are printing them a certain way in our CLI. But we should also be able to make use of loggers to log out the same exceptions. Perhaps we should be unifying something between js-logger and js-errors
Log viewers - it seems all terminal log viewers only view linear logs https://github.com/tstack/lnav, but you do have browser based viewers that can provide more useful trace understanding: https://github.com/openzipkin/zipkin and https://github.com/jaegertracing/jaeger. However, there are terminal and related tools that visualise via a flamegraph, which usually uses the profiler, instead of distributed trace formats. However this is basically the samething, but nobody has built it.

Point is flamegraph was only invented by Brendan Gregg back in 2013/2014. But the same idea can be applied to view an entire trace of the entire distributed system. But also works for a local system too. But the tooling isn't really solid though and seems heavyweight for local debugging. So something inbetween might be useful.

CMCDragonkai commented 2 years ago

Some clarification on terminology.

There's a "standard" for structured logging as exemplified by https://www.structlog.org/.

In that structure, it is not using JSON, but using a human readable structure.

This issue is more about JSON as a structured logging format.

I believe the human readable structure as demonstrated by structlog is a good idea, and we can move our logging format to fit that format eventually. But JSON logging is simply more powerful for any machine analysis of logs.

However atm, I'm not sure what the deal is with information like trace #4, and how structlog would handle information like that. It's not human readable to have information like that on one single line (which is the default for most loggers), it's simply too much information. And if we want to preserve human readability, newlines/tabs/columns on the terminal must be used. The question is, is it really possible to enable easy machine parseability while still using whitespace as a formatting operator for human readability on the terminal? Maybe the goals are in conflict.

Some more info here: https://github.com/hynek/structlog/pull/407

I think we should go as far as we can to maintain structured logging for both human readability AND machine parseability.

In #4 I suggested that if we ever do need to use literal newlines or whitespace operator for formatting, it must always be done with some indentation. That way, one can say that any newline that has ahead of it, is still a continuation of the same log message.

But it's a bit more complicated than that. Because you need a "lookbehind" to say that whatever is before the ` must not also be a\n`.

So this should be fine as that terminates the message:

ABC \n

But this should not be considered an end to the message

ABC \n
  ABC\n

But this would get confused with:

ABC \n
  ABC \n

Where it is 2 messages.

So if there are spaces after \n, then that must be joined up with the first message.

Then machines should be able to parse individual messages.

CMCDragonkai commented 2 years ago

However do note that such usage of newlines are unlikely to be parseable easily, since it depends on logging analysers that know that this can occur. I reckon that it should be possible to code a custom parser for lnav that understands such a formatter though.

However it looks like it still isn't entirely a solved problem: https://github.com/tstack/lnav/issues/410 (multi-line matching).

This further shows that any usage of multiple lines for a log can only really be usable on the terminal and custom log parsers.

But at that point the JSON structure should be used instead.

CMCDragonkai commented 2 years ago

@tegefaulkes @emmacasolin follow this for more structured logging research.

CMCDragonkai commented 2 years ago

Also I noticed that structlog talks about "context", that almost sounds like opentelemetry's idea of spans.

My opinion is that "context" can be done by hooking into async_hooks, however that won't work for distributed spans, which goes across process contexts, so at some point it has to be done through decorators.

It does seem that decorators would be the most expressive way of doing this, as this is basically a sort of AOP (aspect oriented programming) cross-cutting concern. We would want to avoid having to add lots of line-noise, but still be able to drop down to imperative logging for flexible control in edge cases where a decorator just doesn't work.

CMCDragonkai commented 2 years ago

Because PK is a decentralised system, we do have calls that propagate from 1 agent to another agent. Debugging this is crucial, but not necessarily possible in production unless we control all the parties to the interaction. Right now there's a bug in the NAT testing PR where signalling involves agent 1 asking the seed node to ask agent 2 to reverse hole punch. If there's a delay anywhere here, we end up with timeout errors.

CMCDragonkai commented 2 years ago

Structured graph visualisation with contexts/spans/traces. I think we take a page from the git graph, and use things like: https://github.com/nicoespeon/gitgraph.js.

Along with the ink library: https://github.com/vadimdemedes/ink we may even extend the react components to eventually the GUI too show a trace on what is occurring in the system.

CMCDragonkai commented 2 years ago

To solve this issue, I'm just going to incorporate json as formatting style. In terms of the actual data being logged, it'd be similar to what is already available under our formatting. I might incorporate similar keywords to what is already available to align. But the contexts and spans would have to be done in a separate issue.

A JSON format would be one step towards producing traces JSON data that can then be analysed and visualised. There is a relationship between this and "state" simulacra, but somewhat independent. State simulacra is generally specific to the state being modelled, while tracing is very generic and applies to any contextful based systems.

CMCDragonkai commented 2 years ago

@emmacasolin please change up PK to start using the formatting.jsonFormatter when --format=json is set on the CLI as per https://github.com/MatrixAI/js-logger/issues/3#issuecomment-994229643.

MatrixAI / js-logger