census-instrumentation / opencensus-go

A stats collection and distributed tracing framework
http://opencensus.io
Apache License 2.0
2.06k stars 327 forks source link

Show in-process child spans in zpages #782

Open mwuertinger opened 6 years ago

mwuertinger commented 6 years ago

I'm using OpenCensus in a Golang HTTP server with the trace.ProbabilitySampler. As far as I understand the sampling decision is made before the request processing starts and therefore it is currently impossible to influence the decision based on properties of the request outcome (eg. latency, status code). As I was told on Gitter this is currently per design as a tracing decision is usually made in the outer most service and then passed on to all the downstream services.

However, it would be helpful if one could influence the tracing decision within an application even after the request started in order to collect at least partial information about slow requests.

Is there anything planned in that regard?

semistrict commented 6 years ago

Are you using z-pages? One of the features of the tracez page (although it's pretty rudimentary atm) is that you can see some basic info for all spans in a particular latency bucket (even those that were not sampled).

mwuertinger commented 6 years ago

Didn't know that. Will have a look. Thanks :)

semistrict commented 6 years ago

@mwuertinger what kinds of partial information would be useful?

I can think of a way we could pretty easily retain all child spans of a parent span that had high latency and/or errors. Using this, we could display the parent span representing the HTTP/gRPC request and then all the child spans representing e.g. database calls to service that request. It would only contain spans from the same process, not the full distributed trace. Is this what you had in mind?

vaijab commented 6 years ago

I just wonder how useful is the sampling that is based on worst-case scenarios? Wouldn't you lose track of what normal looks like?

mwuertinger commented 6 years ago

@ramonza From what I understand this is the best we can do at the moment. I think it would help in certain situations but I also have to agree with @vaijab that this would distort the overall picture of service health and might lead to the wrong conclusions. It's probably best to leave decisions like that to the individual teams.

I don't think there is public information available but I heard some time ago that Google's internal tracing system does have much smarter sampling decisions. Does anybody know more about that?

g-easy commented 6 years ago

Yes, doing this would distort the overall picture because the sampling would no longer be uniform. We can mitigate that by annotating traces as "sampled uniformly" versus "sampled because something interesting happened."

The advantage of being able to get traces of slow operations is being able to debug why they were slow.

semistrict commented 6 years ago

I agree that it's important to know whether something was sampled uniformly or not. This argument also applies to cases where tracing is explicitly requested from the client. Perhaps we should add an attribute that indicates the sampling policy & weight?

We do already have (in tracez) a way to see spans in each latency bucket. I think adding in-process child spans there might make that a lot more useful. They wouldn't be stored anywhere so wouldn't affect how representative the stored sample is.

mwuertinger commented 6 years ago

@ramonza Adding in-process child spans to tracez sounds like an excellent idea.

semistrict commented 6 years ago

repurposing this issue