Open vanakema opened 2 months ago
Appreciate the issue @vanakema - we definitely need to identify how/where we support and enable tracing. Historically in past products (BB/DUBBD) we included the Tempo tool, which was also identified as a possible future tool for core. We also did some initial work to implement it here, but that largely was put on hold to identify if Core was the right place for it/how to handle it.
We've commonly heard that end users don't see the value in tracing tooling, and might prefer to have it as a "support package", outside of the standard core package. That way it could be deployed at-will, dependent on whether an end user decides they need it. There is probably also an education aspect to this of identifying/explaining how and where tracing can be useful in debugging.
I would say we likely need to support this, but we have options for the "how":
Some of the integration pieces to enable tracing (istio config, network policies to allow traces to be sent, etc) might lean more towards placing it in core.
Is your feature request related to a problem? Please describe.
Debugging issues in systems with many services can be a pain. Debugging performance issues can be even worse, especially if you can't reproduce it in a non-customer environment. You could write prometheus metrics to track end to end latency of one of your services, but that requires forethought, which you don't always have, and the level of granularity (function level vs class level vs service level vs system level) is fixed.
Describe the solution you'd like
OpenTelemetry is a CNCF incubator project that aimed to standardize distributed tracing. Pretty much every distributed tracing system uses it under the hood and they build to that spec. They provide instrumentors for various languages that will automatically instrument your applications, and inject trace_ids into the metadata of your requests (such as http and grpc). These traces are then exported to a collector, and the data is then viewed via a frontend of some sort. At a previous company, we used Jaeger, another CNCF "graduated" tool that was made by Uber originally.
Jaeger + OpenTelemetry let us see latency metrics down to the function call level, which really helped us understand where our latency issues were in our microservices, and it would collect the metrics across all of the microservices, from the point of the user's request coming in, to the response going back to the user, and display it as a single trace. Since this let's us view the performance of the whole system, it let us make engineering decisions at the system-wide level versus making performance improvements to one service when putting that effort somewhere else would've netted us a better outcome.
Describe alternatives you've considered
There really aren't alternatives to OpenTelemetry. This is one of those weird times where somehow everyone seemed to just get behind 1 standard. There used to be OpenTracing which was essentially superseded by OpenTelemetry. As for which implementation of OTel you want to pursue, I leave that up to you. I recommend Jaeger. The UI is easy to use and easy to discover insights or inefficiencies in your system. Other options include SigNoz (was cool, but they were aiming for selling it, and it was not built out enough to be better than Jaeger at the time that I reviewed it in late 2022). DataDog also has their own flavor that makes instrumentation even easier, but requires you to use DataDog (pricy)
Here are some links
https://www.jaegertracing.io/docs/1.60/ < That introduction has some screenshots that may help better explain what distributed tracing is if you don't already know.
https://medium.com/jaegertracing/towards-jaeger-v2-moar-opentelemetry-2f8239bee48e < v2 seems imminent, and completely gets rid of the original jaeger-specific datatypes in favor of using OTel datatypes natively. (Jaeger has been around for a long time, and the first open standard they coded to was the now defunct OpenTracing)