Performance powered by OTel

Milestone: Performance Powered by OTel

Problem Statement

At the time of Sentry's initial performance product development, OpenTelemetry was in the nascent stages of its lifecycle and was not yet optimized for our requirements. Nevertheless, we maintained similarities in our data models and paradigms with OpenTelemetry. Since then, OpenTelemetry has significantly matured, passed the test of time, and has been generally available (GA) for over a year. It now boasts an extensive ecosystem of integrations spanning multiple technologies, including databases, queues, and protocols.

This maturity means that now is the time for us to rework our Python Performance Monitoring to use OTel under the hood. This way, we can leverage all the functionality from the OTel ecosystem, and overall better align with the broader ecosystem.

Key goals of this project

Leverage OpenTelemetry instrumentation for Performance
Ensure a consistent user experience - it should not feel much different to users than the current SDK experience
Do not lose functionality - in places where our instrumentation may exceed what OTel currently offers, we want to either keep some of our own instrumentation, and/or in the medium term upstream improvements to the OTel instrumentations
TLDR: Users continue to use sentry_sdk.init() and everything just works™️.

Non-goals

For the initial work, we want to strive to optimize for easy setup & usage. Exposing OTel internals & providing more hooks etc. for users to manually add OTel stuff may come at a later point. The public API of the SDK will remain the same for the prototype.

Misc

Internally, this project is known as POTel (from performance powered by OTel).

References

parent issue: https://github.com/getsentry/team-sdks/issues/4
sibling NodeJS issue: https://github.com/getsentry/sentry-javascript/issues/8534

We need to check whether the following works (repurposed from https://github.com/getsentry/opentelemetry-demo/issues/12):

Product Features

What works out of the box? What doesn't? What works, but differently?

Note: See https://www.notion.so/sentry/SDKs-Performance-powered-by-OTel-POTEL-7f1900c5c1b04870bdc0f2cc8dd4d929?pvs=4#50f3fec4bef3402087ec035688329efa (internal link) for more in-depth description.

[ ] Error Tracking
- [ ] Issues
- [x] 'POST /test/12344?test=true'
  - returns an error - should be reflected
- [x] Stack trace and code link
- [x] Breadcrumbs
- [ ] Related transactions
- [x] Trace view
- :x: Performance Issues
[ ] Performance Monitoring
- [ ] Distributed tracing
- [ ] Transactions and Spans
- Naming, descriptions, etc
- Operation: http.server
  - [ ] parameterized transactions
  - GET /?test=true
  - POST /test/12344?test=true
  - [ ] http.method
  - [ ] related issues
- Operation: queue.task.____
- Operation: integration.http
- Operation: db
- [ ] Vitals
  - [ ] User Misery
  - [ ] Backend
- [ ] TwP
[ ] Release Management
- [ ] Adoption, Crash related features
[ ] Discover & Dashboards
[ ] Ecosystem & Workflow
- consider product integration dependent upon incoming data
- alerts for example
:x: Profiling
[ ] Replays

As we're looking into continuing with the project again, there's a few questions that have come up since we last worked on this.

What do we gain and what do we lose by using OTel and how do we bridge the gap?
- The quality of the spans produced and the data on them should be on par with the current Sentry experience, e.g., we have a lot of custom stuff on DB spans that are used in specialized parts of Sentry. This is potentially a huge topic that might require upstream PRs as well as changes on the product side.
- Similarly, we'll likely need to adopt a more strict Python and library version support schedule.
- Furthermore, we'll need to contribute instrumentation upstream if we have something that is not in OTel yet.
- We need to identify the gaps and think about how to bridge them.
How does the scopes change affect the implementation?
- The new scopes implementation is easier to map onto the OTel context propagation. However, it's still two separate context propagation systems. We need to figure out to what degree and how we can keep them in sync.
- Since there are no context lifecycle hooks, we'll probably need to store our scopes on the OTel context to enrich the resulting span later in the processor.
Error monitoring and tracing are currently quite tightly coupled in the SDK. What challenges does this pose?
How does direct OTLP span ingestion affect the implementation?
- Is the span processor still needed? (Might be, though not for creating Sentry span spans, but rather for augmenting the OTel spans with more Sentry data.)
How does span streaming affect the implementation?

Couple of things that came up while trying to test the feature gap. Documenting them here for future reference.

Getting Things to Install & Run

OTel provides a package for each instrumentation, so there's e.g. opentelemetry-instrumentation-flask, opentelemetry-instrumentation-django, etc. There's also a way to install all available instrumentation packages at once by installing opentelemetry-contrib-instrumentations. However, this is broken in recent versions due to packaging issues with one of the instrumentation packages involved. So if we want to use the latest versions of everything, we need to install all the individual packages manually.

And we do in principle want to install the newest versions of everything -- for instance, the psycopg instrumentation was only added in the last release.

Instrumenting

We have two basic options how to instrument programmatically.

We call the instrument() method of each of the Instrumentors (e.g., FlaskInstrumentor, DjangoInstrumentor, etc.). This has the advantage of being able to pass in optional arguments to tweak the instrumentation (e.g., adding DB spans for Django).
We use the OTel auto-discovery mechanism to figure out what should be instrumented. Unknown if there's a way to provide additional arguments to the individual instrumentation since by default, it'll call instrument() without any args.

Note that whatever way we choose, classes imported before sentry_sdk.init will still be left uninstrumented. The solution for that is to either require folks to import sentry_sdk and init it before anything else has even been imported, which is not great; or seeking out and post-patching the imported classes.

One idea to tackle the "classes imported before sentry_sdk.init will left uninstrumented":

What opentelementry-instrumentation does is setting the PYTHONPATH env variable to '.../lib/python3.10/site-packages/opentelemetry/instrumentation/auto_instrumentation:/home/anton/my_project' In .../lib/python3.10/site-packages/opentelemetry/instrumentation/auto_instrumentation there is a file called sitecustomize.py And when the Python interpreter starts it picks up this file and in this file the instrumentation is setup. This is a feature of Python (I did not know about): https://docs.python.org/3/library/site.html#module-site

Maybe we can do the same and put our own sentry_sdk.init in the sitecostumize.py? (Then we could people tell to set SENTRY_DSN env var and start their program and everything just works(tm).)

getsentry / sentry-python