Open graysonzeng opened 2 weeks ago
@graysonzeng Frankly speaking, it depends.
In summary, batching, pipelining parallelism, intra-operator paralellism, async IO, data/control dependencies and etc. make the subtasks of the query plan execute in an extremely intricate scheduling orders, so it is hard to estimate progress. however, if your query is ETL workload and runs for quite a long time, maybe finished scan ranges can be used.
After some investigation, I found that trino provides a good reference for query progress.
Query 16, RUNNING, 1 node, 855 splits
http://my.server:8080/v1/query/16?pretty
Splits: 646 queued, 34 running, 175 done
CPU Time: 33.7s total, 191K rows/s, 16.6MB/s, 22% active
Per Node: 2.5 parallelism, 473K rows/s, 41.1MB/s
Parallelism: 2.5
Peak Memory: 1.97GB
Spilled: 20GB
0:13 [6.45M rows, 560MB] [ 473K rows/s, 41.1MB/s] [=========>> ] 20%
STAGES ROWS ROWS/s BYTES BYTES/s PEND RUN DONE
0.........R 13.8M 336K 1.99G 49.5M 0 1 706
1.......R 666K 41.5K 82.1M 5.12M 563 65 79
2.....R 4.58M 234K 620M 31.6M 406 65 236
Just like the trino query example above, the scan progress and stage calculation process are displayed separately. Maybe we can refer to these two parts. What do you think? @satanson
Enhancement
Description
For long time running sql queries, hope the progress can be queried through queryid for user easy to use. One suggestion is to use the FinishedFragmentNums/TotalFragmentNums to represent query progress.
we can see the progress like:
RUNNING
FINISHED