getsentry / sentry-native

Sentry SDK for C, C++ and native applications.
MIT License
395 stars 166 forks source link

Feature Request: Provide explicit timings to the Performance Monitoring API #941

Open leops opened 8 months ago

leops commented 8 months ago

Exposing a lower-level API for manually creating transactions and spans with explicit timing values (instead of automatically reading the system clock when the spans are started / finished) would be useful to capture timing informations at a higher resolution than the default millisecond (which is often not precise enough for native code so microseconds / nanoseconds may be needed), or to build up a transaction from timing data coming from an external source (eg. performance counters from the GPU). I could easily open a PR implementing this, but it does raise the question of what the naming convention and general API design for these explicit transaction management functions should be.

supervacuus commented 8 months ago

Hi @leops, as far as I know, microseconds are the current limit in the backend for performance/tracing events. Anything smaller than this would be truncated (or rounded), so opening the API to arbitrary scales doesn't make much sense (without considerable changes in the backend, and that discussion would have to happen in another issue tracker). Changing our timestamping to microseconds might be sensible, though.

I am unsure whether the tracing API is the proper interface for instrumentation at that scale level because we can gather a maximum of 1000 spans before a transaction must flush. Transaction flushing is a costly operation (the transport happens in a backend thread, but preparing the envelope for sending happens in the calling thread and includes serialization, something you might not want to have in a tight loop).

This sounds like the perfect use case for developer metrics, where you have client-side aggregation of arbitrary measurements (you can use it for profiling code at that low level) over time. This is implemented only in the backend and in the Python SDK (?) in a very early phase and only rolled out to some customers. However, a discussion regarding the needs of Native SDK users might already make sense.

CC: @kahest.

leops commented 8 months ago

Setting the resolution limit to microseconds sounds sensible, I think collecting timings at nanosecond precision is bound to have a lot of measurement noise for one-shot captures anyway. The cost of transaction flushing doesn't sound like that much of a problem though as having an explicit API would allow for the envelope to be built up asynchronously in a background thread (this would necessarily be the case for collecting GPU timings for instance, with timestamp queries being processed asynchronously on the CPU between frames), but I agree this kind of high-resolution transaction should still avoid getting anywhere near the 1000 spans limits and only keep track of operations at a macroscopic level within that scope since performing too many measurements would once again add a lot of noise / overhead. I think the metrics feature is potentially interesting, as would profiling but there's still use case where you want to have a clear hierarchical view of what's going on when a given operation is being slower than expected, and profiling isn't supported in the Native SDK (and once again even less so on GPUs)

supervacuus commented 8 months ago

Hi @leops. That makes a lot of sense. So, to quickly summarize:

Do you think this makes sense? I am asking to understand what kind of interface would be sufficient.

leops commented 7 months ago

Yes I think this matches what I had in mind