Closed dianasaur323 closed 6 years ago
@knz things number of rows, total processing time, and amount of memory, and disk space are possible. If any of these stats are hard to collect, let's punt.
If at all possible, I would also love to capture time spent waiting for a given query.
cc @asubiotto
I believe we want to return these statistics in SHOW TRACING
Returning the statistics in SHOW TRACE
is a good start. We can also consider adding a SHOW QUERY STATISTICS
that is a version of SHOW TRACE
that prints out just the stats.
In terms of prioritization, I'd say the most important one is the amount of rows (and/or bytes) moved to external storage, followed by memory used and the amount of data processed, and then the total+stall time.
I think this list of information to be collected is good. I would add:
I'd clarify what we want out of "Total time spent on each processor". The time when a processor started running and when it finished we already have - it corresponds to the processor's span. But that's not very interesting; most of them will live as long as the whole query. What would be very interesting, I think, is some measure of how much CPU it used. But that may be hard to collect; I guess we need some sampling of stacks.
Similarly with "stall time" - I'm not sure how we'd measure this cheaply. Perhaps for stalls we should start by instrumenting the lowest level of SQL - i.e. kvFetcher
- and measure explicitly its ScanRequests
; we could aggregate these stats just for the TableReader
processor, which seems more tractable to me.
As for presentations, I agree that SHOW TRACE
is a good start; namely, I think we should log these stats as regular log messages, and they'll make it to the trace that way.
Separately, we should also output them as distsql metadata, even if initially we don't do anything with the metadata.
@asubiotto to answer your offline question - how to structure this information and eventually present it.
The place we need to get at eventually is to attach each of these metrics to the specific stage in the query plan that they are measuring. Eventually we'll want at least to present an EXPLAIN(STATISTICS) that looks and feels like pg's, and we'll want to annotate the distsql plan with collected execution statistics.
Now the question is really how to get there.
What I have in mind is the following:
The heart of the "presentation" stage here is an algorithm that scans/parses/analyzes a trace and constructs a dictionary, from plan node ID to stats. Initially (in scope for you) we'll showcase stats collection with EXPLAIN, but soon after that' we'll also collect this information in memory so as to show it in the admin UI, etc. (That's out of scope for you now.)
Thanks everyone!
Here is how I think I'm going to proceed with this work:
tableReader
, outputting to trace:
EXPLAIN (DISTSQL)
This is a stepping stone towards more reporting and query analysis tooling, but for now, it would also provide a lot of value when it comes to helping us debug the performance of queries. We do need to set the stage to build better debugging tools going into 2.1.
The information below will be added to
SHOW TRACE
cc @RaduBerinde @andreimatei @knz @vivekmenezes