cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.08k stars 3.8k forks source link

sql: collect additional runtime query statistics #19476

Closed dianasaur323 closed 6 years ago

dianasaur323 commented 7 years ago

This is a stepping stone towards more reporting and query analysis tooling, but for now, it would also provide a lot of value when it comes to helping us debug the performance of queries. We do need to set the stage to build better debugging tools going into 2.1.

The information below will be added to SHOW TRACE

cc @RaduBerinde @andreimatei @knz @vivekmenezes

dianasaur323 commented 7 years ago

@knz things number of rows, total processing time, and amount of memory, and disk space are possible. If any of these stats are hard to collect, let's punt.

If at all possible, I would also love to capture time spent waiting for a given query.

cc @asubiotto

vivekmenezes commented 6 years ago

I believe we want to return these statistics in SHOW TRACING

RaduBerinde commented 6 years ago

Returning the statistics in SHOW TRACE is a good start. We can also consider adding a SHOW QUERY STATISTICS that is a version of SHOW TRACE that prints out just the stats.

In terms of prioritization, I'd say the most important one is the amount of rows (and/or bytes) moved to external storage, followed by memory used and the amount of data processed, and then the total+stall time.

andreimatei commented 6 years ago

I think this list of information to be collected is good. I would add:

I'd clarify what we want out of "Total time spent on each processor". The time when a processor started running and when it finished we already have - it corresponds to the processor's span. But that's not very interesting; most of them will live as long as the whole query. What would be very interesting, I think, is some measure of how much CPU it used. But that may be hard to collect; I guess we need some sampling of stacks. Similarly with "stall time" - I'm not sure how we'd measure this cheaply. Perhaps for stalls we should start by instrumenting the lowest level of SQL - i.e. kvFetcher - and measure explicitly its ScanRequests; we could aggregate these stats just for the TableReader processor, which seems more tractable to me.

As for presentations, I agree that SHOW TRACE is a good start; namely, I think we should log these stats as regular log messages, and they'll make it to the trace that way. Separately, we should also output them as distsql metadata, even if initially we don't do anything with the metadata.

knz commented 6 years ago

@asubiotto to answer your offline question - how to structure this information and eventually present it.

The place we need to get at eventually is to attach each of these metrics to the specific stage in the query plan that they are measuring. Eventually we'll want at least to present an EXPLAIN(STATISTICS) that looks and feels like pg's, and we'll want to annotate the distsql plan with collected execution statistics.

Now the question is really how to get there.

What I have in mind is the following:

The heart of the "presentation" stage here is an algorithm that scans/parses/analyzes a trace and constructs a dictionary, from plan node ID to stats. Initially (in scope for you) we'll showcase stats collection with EXPLAIN, but soon after that' we'll also collect this information in memory so as to show it in the admin UI, etc. (That's out of scope for you now.)

asubiotto commented 6 years ago

Thanks everyone!

asubiotto commented 6 years ago

Here is how I think I'm going to proceed with this work: