Best practices/contract to trace exceptions

lmolkova commented 6 years ago

Exceptions/errors tracking is essential for debugging. What is the recommended way to report the exception or error with opencensus?

From the library instrumentation perspective, it's natural to report exceptions along with the span i.e. not through a logging library as it requires introducing additional dependencies.

It's likely that some exception info is set on the Status, but there might be retriable exceptions that do not affect operation result. It's also essential to report stack trace (and stack trace on the span is not really supported/being deprecated).

So, assuming reporting exception via opencensus is legit, can we come up with the best practice/contract for it so that all vendors may rely on it to build error analysis UX/etc...

This contract may be implemented as a guidance or a new API:

Represent exception with an annotation on the span with specific attributes:
- severity: enum Critical/Error/Warning/Info
- stackTrace: string
- message in the description
Introduce a new API: list of exceptions (time events) on the Span with message and stack trace.

We could start with the 1st approach and if we find that everyone needs and actively uses it, we can implement p2.

@bogdandrutu @adriancole @ramonza @SergeyKanzhelev Thoughts?

codefromthecrypt commented 6 years ago

at least in java, stack trace is expensive to attain. Also different backends accept different data, so adding that overhead might be in vain. Best I can muster is to pass the exception to the backend. Ironically, I had a twitter exchange on this very topic yesterday with Raphael who is one of the smartest java people I know https://twitter.com/adrianfcole/status/1024900302967717889

On Fri, Aug 3, 2018 at 2:39 AM, Liudmila Molkova notifications@github.com wrote:

Exceptions/errors tracking is essential for debugging. What is the recommended way to report the exception or error with opencensus?

From the library instrumentation perspective, it's natural to report exceptions along with the span i.e. not through a logging library as it requires introducing additional dependencies.

It's likely that some exception info is set on the Status, but there might be retriable exceptions that do not affect operation result. It's also essential to report stack trace (and stack trace on the span is not really supported/being deprecated).

So, assuming reporting exception via opencensus is legit, can we come up with the best practice/contract for it so that all vendors may rely on it to build error analysis UX/etc...

This contract may be implemented as a guidance or a new API:

Represent exception with an annotation on the span with specific attributes:

severity: enum Critical/Error/Warning/Info

stackTrace: string

message in the description

Introduce a new API: list of exceptions (time events) on the Span with message and stack trace.

We could start with the 1st approach and if we find that everyone needs and actively uses it, we can implement p2.

@bogdandrutu https://github.com/bogdandrutu @adriancole https://github.com/adriancole @ramonza https://github.com/ramonza @SergeyKanzhelev https://github.com/SergeyKanzhelev Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/census-instrumentation/opencensus-specs/issues/154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD616onv721Hhb27-4HnP9KYY0vLKr6ks5uM0dGgaJpZM4VszaY .

lmolkova commented 6 years ago

@adriancole thanks for the info! I was thinking about holding string stack trace rather than a reference to the exception object. We will be serializing exceptions anyway for ocd and for the backend that has to support language-agnostic data.

codefromthecrypt commented 6 years ago

I was thinking about holding string stack trace rather than a reference to the exception object. We will be serializing exceptions anyway for ocd and for the backend that has to support language-agnostic data.

if there's a hash possible without generating a stack trace string, all the copies of that stack trace would be redundant. I think main thing is the benchmarking, but also even if benchmarking is ok, note not every consumer will want to store a stack trace, or be able to use the string form that we might decide on.

codefromthecrypt commented 6 years ago

so ways to proceed could be to survey the tracing (and metrics!) consumers of exceptions, how they process and store things today. If any would be able to consume a format we might want. Then, simultaneously benchmark the proposed form with various lengths of stack traces.

codefromthecrypt commented 6 years ago

(and logging and whatever other apis that census wants to have apis for)

rakyll commented 6 years ago

One of the disadvantages of doing this in the trace library is the fact trace visualization tools don't know how to properly visualize or analyze stack traces. The other short-term problem is whether we can break it when we have the logging support in OpenCensus, logging is where stack traces fit better.

lmolkova commented 6 years ago

@rakyll I agree that logging is the best place for exceptions/stack traces in the long term. I just thought there is no agreement on supporting it, so there is a plan to support logging in OpenCensus?

bogdandrutu commented 6 years ago

I was thinking about this and looked around different possibilities:

Record in the Span.status. This has the downside that it is only one per span, does not have a timestamp. I still think our Status class should have a fromException method that generates a status from the exception because it is useful for when the code decides to give up and return. a. This does not solve the initial request in this issue but I think it is nice to have.
Have a new time event Error/Exception (or encode into an annotation). I think this satisfies all the initial requirements in this issue. This has the downside that errors are reported only when the Span is recorded and exported.
Have an error-reporting API separate from trace. We can correlate the errors with the current Span (similar to how we can do log correlation). This allows error-reporting to be sampled using different strategies (e.g. always record them). For example Stackdriver has a different service for error-reporting see https://cloud.google.com/error-reporting. Different vendors can decide to record these into the Spans or Logs or ErrorReportingServices (like Stackdriver).

I tried to dump all thoughts about this subject. I kind of like the option 3 but would like to hear from others about this.

bogdandrutu commented 6 years ago

/cc @mtwo

census-instrumentation / opencensus-specs

Best practices/contract to trace exceptions #154