tracing: improve docs for use in support cases

tbg commented 2 years ago

Is your feature request related to a problem? Please describe.

Traces, and in particular their textual representation, are used during support escalations. However, many members of our support org and eng teams are not trained at interpreting traces. In such cases, our default fallback is to loop in the KV team.

This is not ideal - we should educate everyone who needs to look at traces the basics of interpreting them. (This may still mean that the KV team will be asked to reason about something that happened in their domain, just that they will interface with someone who is reasonably educated on our general tracing infrastructure and thus the composition of traces)

Describe the solution you'd like

We already have this (internal) draft by @abarganier which is a great resource to get started with.

It would likely need some more work to explain the text trace format (though I am of the opinion that we should always be using jaeger), and we need to figure out how to get people to actually read that document, and how obs-inf will know when it needs to be extended.

Describe alternatives you've considered

Additional context

Jira issue: CRDB-16059

blathers-crl[bot] commented 2 years ago

Hi @tbg, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

abarganier commented 2 years ago

He @tbg - thanks for bringing this up. I'd be happy to spend some time expanding the existing doc to elaborate more on interpreting text-based traces instead of only using Jaeger.

Just curious - is there anything else you think the document is missing? One idea is to have the 1st case study from the doc use the Jaeger format (as it is today), and then add a 2nd case study where we use the text format. Are there specific things beyond just the format of the trace that you'd like to see, though?

tbg commented 2 years ago

Thanks Alex! The main thing that came out of this last escalation is that folks are having a hard time understanding what the duration of an event was (the timestamp next to an event is, I think, the duration since the last event). I'm sure there's more, maybe @erikgrinaker has something concrete to add.

I think it might also be worth taking a step back. Should anyone use the text-based format? What should it be used for? Should we (obs-inf or sql-obs) work towards using the text based format less? For example, the UX in https://github.com/cockroachlabs/support/issues/1585#issuecomment-1129328320 where we dump the text based process in the logs seems like pretty bad UX, I wouldn't want to look at a trace like that; why isn't this logging the link to a jaeger dump on one of the stores' directories. (This would probably fall into SQL-obs).

And for the documentation/training that we no doubt need, how to we put it in front of the folks who should see it? Maybe it is enough if KV (who usually gets asked to look at traces) in non-urgent situations gently redirects to this doc and to the obs-inf team as a resource to help folks through "the basics", but this will always leave at least a slightly sour taste. Maybe TSEs haven't picked up this doc yet though, and could be made more aware of it (i.e. link to it from their training resources). Maybe the doc doesn't show up in Glean for some reason and can be SEO'ed a bit. Maybe the doc should be linked from all traces (README.txt in the statement bundle for example) that folks generally obtain.

These are all just ideas, ideally obs-inf think a bit about what should be done soon vs later vs never and take appropriate action (cc @nkodali). For KV, hopefully we get "out of the business" of teaching tracing on escalations. This doesn't happen too frequently, btw, but it does happen.

erikgrinaker commented 2 years ago

Thanks Alex! The main thing that came out of this last escalation is that folks are having a hard time understanding what the duration of an event was (the timestamp next to an event is, I think, the duration since the last event). I'm sure there's more, maybe @erikgrinaker has something concrete to add.

Yeah, the text trace doesn't make it clear that it's the duration between events, not the duration for a single event. I see this tripping readers up again and again. Take this example (I've chopped off the start for brevity):

  0.109ms      0.029ms        event:sql/catalog/catalogkv/namespace.go:184 [n1,client=127.0.0.1:26342,hostnossl,user=root] looking up descriptor ID for name key /NamespaceTable/30/1/52/29/"journal"/4/1›
  0.115ms      0.006ms            === operation:txn coordinator send _verbose:1 client:127.0.0.1:26342 hostnossl: node:1 txnID:88765b9e-d139-492b-a9f0-f471b5a8779c user:root›
  0.125ms      0.010ms                === operation:dist sender send _verbose:1 client:127.0.0.1:26342 hostnossl: node:1 txn:88765b9e txnID:88765b9e-d139-492b-a9f0-f471b5a8779c user:root›
  0.145ms      0.020ms                event:kv/kvclient/kvcoord/range_iter.go:175 [n1,client=127.0.0.1:26342,hostnossl,user=root,txn=88765b9e] querying next range at /NamespaceTable/30/1/52/29/"journal"/4/1›
  0.276ms      0.131ms                event:kv/kvclient/kvcoord/dist_sender.go:1867 [n1,client=127.0.0.1:26342,hostnossl,user=root,txn=88765b9e] r26: sending batch 1 Get to (n10,s10):4›
  0.288ms      0.012ms                event:rpc/nodedialer/nodedialer.go:132 [n1,client=127.0.0.1:26342,hostnossl,user=root,txn=88765b9e] sending request to 10.32.49.174:7102›
  0.295ms      0.007ms                    === operation:/cockroach.roachpb.Internal/Batch _verbose:1 client:127.0.0.1:26342 component:gRPC hostnossl: node:1 span.kind:client txn:88765b9e txnID:88765b9e-d139-492b-a9f0-f471b5a8779c user:root›
 33.579ms     33.284ms                    === operation:/cockroach.roachpb.Internal/Batch _unfinished:1 _verbose:1 component:gRPC node:10 span.kind:server txnid:88765b9e-d139-492b-a9f0-f471b5a8779c›
 33.603ms      0.024ms                    event:server/node.go:870 [n10] node received request: 1 Get›
 33.645ms      0.042ms                    event:kv/kvserver/store_send.go:147 [n10,s10] executing Get [/NamespaceTable/30/1/52/29/"journal"/4/1,/Min), [txn: 88765b9e]›
 33.661ms      0.016ms                    event:kv/kvserver/replica_send.go:92 [n10,s10,r26/4:/NamespaceTable/{30-Max}] read-only path›
 33.676ms      0.015ms                    event:kv/kvserver/concurrency/concurrency_manager.go:122 [n10,s10,r26/4:/NamespaceTable/{30-Max}] sequencing request›
 33.685ms      0.009ms                    event:kv/kvserver/concurrency/concurrency_manager.go:158 [n10,s10,r26/4:/NamespaceTable/{30-Max}] acquiring latches›
 33.699ms      0.014ms                    event:kv/kvserver/concurrency/concurrency_manager.go:170 [n10,s10,r26/4:/NamespaceTable/{30-Max}] scanning lock table for conflicting locks›
 33.716ms      0.017ms                    event:kv/kvserver/replica_read.go:139 [n10,s10,r26/4:/NamespaceTable/{30-Max}] executing read-only batch›
 33.806ms      0.090ms                    event:kv/kvserver/replica_read.go:123 [n10,s10,r26/4:/NamespaceTable/{30-Max}] read completed›
 66.621ms     66.505ms            event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:593 [n1,client=127.0.0.1:26342,hostnossl,user=root,txn=88765b9e] recording span to refresh: /NamespaceTable/30/1/52/29/"journal"/4/1›
 66.655ms     66.546ms        event:kv/txn.go:850 [n1,client=127.0.0.1:26342,hostnossl,user=root] client.Txn did AutoCommit. err: <nil>›
475.469ms    408.814ms        event:sql/plan_opt.go:339 [n1,client=127.0.0.1:26342,hostnossl,user=root] query cache hit (prepare)›

In particular, note these events:

 66.621ms     66.505ms            event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:593 [n1,client=127.0.0.1:26342,hostnossl,user=root,txn=88765b9e] recording span to refresh: /NamespaceTable/30/1/52/29/"journal"/4/1›
 66.655ms     66.546ms        event:kv/txn.go:850 [n1,client=127.0.0.1:26342,hostnossl,user=root] client.Txn did AutoCommit. err: <nil>›
475.469ms    408.814ms        event:sql/plan_opt.go:339 [n1,client=127.0.0.1:26342,hostnossl,user=root] query cache hit (prepare)›

This often leaves readers wondering "why are we spending 66.505 ms to refresh spans, and why does a cache lookup take 408.814 ms?". But that's not what this trace means at all.

recording span to refresh is a fairly trivial event which just records the key spans of the response in memory, so it takes no time at all. However, it's typically emitted last during txncoordsender processing, and the duration next to it records the time from the previous event at that level (operation:txn coordinator send). So 66.505ms is the time it took to send the requests and wait for a response, not the time it took to record the span to refresh.

Similarly, the query cache lookup did not take 408.814 ms. It took 408.814 ms from the AutoCommit event to the query cache event, and the system could have been doing all sorts of things in the meanwhile -- basically all of the code between kv/txn.go:850 and sql/plan_opt.go:339, which isn't instrumented to emit trace events.

I don't know how to make this more clear in the text format, but I think we need to do something about it because it keeps coming up again and again.

erikgrinaker commented 2 years ago

To be clear, I think the primary gain here would come from making the text trace more understandable (in particular the meaning of the durations). Alternatively, we should slap a big fat comment at the start of the trace.txt file in statement bundles that explains this.

tbg commented 2 years ago

would something like this help?

66.621ms     ⤷ 66.505ms            event:kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:593 [n1,client=127.0.0.1:26342,hostnossl,user=root,txn=88765b9e] recording span to refresh: /NamespaceTable/30/1/52/29/"journal"/4/1›

But I think there is also an issue that this duration is relative to the previous event in the same trace span (?!) which may not be the preceding line. (Or something like that, I just remember it being more complicated yet).

erikgrinaker commented 2 years ago

would something like this help?

Yeah, we could likely do something simple like that. We'd have to try it out on a few people who have never seen a trace before and ask them what it means.

But I think there is also an issue that this duration is relative to the previous event in the same trace span (?!) which may not be the preceding line. (Or something like that, I just remember it being more complicated yet).

Yes, it's the previous event in the same span (so in the text version, the previous event at the same indendation level). Notice how the 33.284ms batch response is included in the refresh spans duration, because it was indented below the previous event.

github-actions[bot] commented 11 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

tracing: improve docs for use in support cases #81755