canonical / go-dqlite

Go bindings for libdqlite
https://dqlite.io
Apache License 2.0
425 stars 68 forks source link

OpenTelemetry Driver Tracing #276

Closed SimonRichardson closed 11 months ago

SimonRichardson commented 12 months ago

To aid with debugging and development, the following implements the ability to trace requests to the driver. This includes adding the current SQL from conn methods ExecContext or QueryContext, and also adding the stmt ExecContext and QueryContext. Unfortunately, we can't trace every method as the go driver library doesn't allow passing a context to Commit or Rollback.

This is the lightest touch with adding tracing. The tracing code does not require any external dependencies, as this retrieves a Tracer from the passed-in context. As long as the Tracer conforms to the tracing.Tracer interface then a trace can be performed.

There will always be a Tracer if one isn't located on the context.

Potentially, this could supersede the current tracing implementation and remove the need to have a tracing log.

Example

Currently, the Juju project is starting to implement OpenTelemetry to better understand requests within the system. As the dqlite project is a big part of Juju it would be good if we could see what's happening within the go-dqlite library. I've added a work-in-progress tempo screenshot to show what's available with this PR and what information we might start to gather.

Screenshot from 2023-09-22 22-02-24

cole-miller commented 11 months ago

Thanks! I'm not familiar with OpenTelemetry but I will take a look at this and see what I can make of it. We really need a better tracing/observability approach for dqlite more broadly, I'm not sure whether OTel is the way to go there, but I do generally agree that the current ad-hoc logging leaves a lot to be desired and we could think about removing it entirely once something better is in place.

SimonRichardson commented 11 months ago

So Juju is trying to integrate with the COS (Canonical Observability Stack) framework, which uses Tempo for tracing. Note: This tracing is different from DTrace/strace. They serve different objectives.

We're attempting to provide SREs with a turnkey solution to observe and maintain Juju directly. Selfishly speaking, I want to be able to use otel to give me active traces and understand what may be causing Juju to slow down/or if there are excessive requests in certain parts of the code. Now, I know that it might not tell me exactly, but that's fine. More investigation will probably be required, but if it can help identify issues, then we're in a better place than where Juju is currently.

In addition to this, there is Polar Signals cloud which we're also experimenting with, which will give you the ability to continuously profile. This is analogous to tracing. Profiling tells you what's happening, but generally doesn't support why, which is why we're tackling it from both viewpoints.

SimonRichardson commented 11 months ago

Just FYI: it is possible to provide a way of using the new tracing API (start and end) to instrument in the prior way. So we could migrate away from the old way, whilst still offering the new way. I've stuck it up as a prototype on my repo as a potential path forward. https://github.com/SimonRichardson/go-dqlite/pull/1

cole-miller commented 11 months ago

Thanks!