Design for retroactive tracing & exemplars

semistrict commented 6 years ago

Through trace sampling, we might miss important traces that don't occur very frequently for example traces leading to error conditions or high latency.

We should provide a facility for starting tracing later during request processing when we detect an error or other interesting condition. We should rate limit this at the source to avoid cascading failure.

tsloughter commented 6 years ago

I would suggest instead always tracing and having "sampling" being whether or not it is reported/stored. Assuming the language/framework's span creation is lightweight it makes sense to simply create the span and if certain conditions are met (like it took over X ms to complete or an exception was thrown) sample/report the span.

We plan to add this to opencensus-erlang and it would be nice to be a part of the spec in some way.

semistrict commented 6 years ago

This only works on the Span in the current process. If you really want to get the whole trace you need need some way to tell your caller to start tracing as well. For gRPC, I can see this fitting in response metadata (AFAIK these are internally HTTP trailing headers so those might work for HTTP too, although I don't know how well that would be supported by HTTP libraries).

tsloughter commented 6 years ago

Yea, not sure how well that'd work, but sounds possible.

Another option would be if spans were pulled instead of pushed, so something collected spans, so if a span's trace is enabled later it is requested (along with any other spans a process wants to report) by the collector. If that makes sense.. basically like Prometheus but where the pull request could also include trace ids.

semistrict commented 6 years ago

For a description of exemplars, see: https://www.youtube.com/watch?v=U72b4Nl0Ftw

bogdandrutu commented 6 years ago

Internally in the team we decided to have @g-easy and @sebright as owners of this feature. Expect a design proposal soon.

tcolgate commented 6 years ago

@tsloughter Tom Wilkie had a go at a Prometheus inspired pull based tracing. I've not tried it, I got the impression he didn't entirely convince him self that it was a good approach. https://www.weave.works/blog/distributed-tracing-loki-zipkin-prometheus-mashup/

tcolgate commented 6 years ago

It seems reasonable to limit this to http2/grpc. You could also say have thresholds on latency/response code that would guarantee collection (with some upper bounce I guess)?

maguro commented 6 years ago

re: pulled base tracing

How would the puller know the correct transitive closure of nodes to pull spans from? That closure can be wildly different per trace.

Also, the implications are that the entire network of nodes need to store all spans for some predetermined period. That predetermined period needs to be the same across the realm of nodes being traced. That seems untenable.

semistrict commented 6 years ago

Any design for retroactively exporting interesting spans will I think require nodes to retain spans somewhere long enough for the sampling decision to be made.

Luckily, it's all best-effort so we can have a fixed-size buffer per node and just store as much as possible within that fixed size. Or, we could sample just at a much higher rate.

I have been thinking about a way of doing this that leverages existing central storage systems. Instead of moving to a pull-based approach, store somewhere central (e.g. memcached, redis, a database) a set of bloom filters for interesting traces, one per 10 second interval for the last 120 seconds (for example).

When you want to mark a trace as "interesting" and to be exported (for example, if an error occurs) you add the trace ID to the active bloom filter.

All nodes periodically read the bloom filters and export any spans with matching trace IDs.

Old bloom filters can just be dropped. A new bloom filter should be created as the active one for each new 10s period.

census-instrumentation / opencensus-specs

Design for retroactive tracing & exemplars #46