[Feature] Add execution tracing with performance for MQE engine and BanyanDB storage

wu-sheng commented 4 months ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

By adopting BanyanDB as first-class storage, we are going to keep adding observability to ourself, as we have more capability to optimize performance. So here, after @hanahmily and I had a discussion, I want to propose a new debugging tool for MQE query.ourselves

Use case

MQE is the most important query engine right now for several versions. Since v10, MQE + BanyanDB storage are recommended as always. In order to help end users, we hope to diagnose the performance easier in collecting context and end-to-end performance costs.

To implement that, we need to add things as following

[ ] 1. BanyanDB 0.7 implemented structured execution plans with an on-demand flag. This requires an extension for a query. When its client runs a query with that flag, it should return the data with execution plans from the server side.
[ ] 2. A new optional GraphQL flag and response extension for MQE, which could drive the engine to build an execution process tree(like an internal self-trace), including the calling methods(such as metric query, and sort query). When the storage has server-side plans, such as <1>, we could wrap them inside the trace segments.
[ ] 3. As the newly added flag from <2> is optional, UI should have the capability to visualize this trace on-demand. We need to discuss how to visualize this. Maybe through a pop-up debug box aside from the widget, and showing the MQE debugging trees one by one. Also, considering the query refreshing, the UI should be able to hold 5 to 10 recent query logs.

Related issues

No response

Are you willing to submit a pull request to implement this on your own?

[ ] Yes I am willing to submit a pull request on my own!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

wu-sheng commented 4 months ago

As most queries are very fast, we are going to use ns(1e+9 of a second) as the time unit. Less than 1 ns could be ignored.

wu-sheng commented 4 months ago

For example, when we apply for MQE(service_percentile{p='50,75'} - avg(service_percentile{p='50,75'})), the execution tracing should look like

- MQE expression, service_percentile{p='50,75'} - avg(service_percentile{p='50,75'})
  - duration: 100 ns
  - queries
     - MQE syntax analysis
        - duration: 10ns
        - error:
     - readMetricsValues(service_percentile)
        - duration:
        - server-side traces....
     - /* Multiple metrics queries if needed. */
     - /* We need to consider to flag concurrency queries if MQE supported to run in that mode. */
     - In-memory calculation
        - error:
        - duration:

wu-sheng commented 4 months ago

Besides the server-side(OAP) response, UI should trace the query cost from browser perspective, which provide extra information whether the query is slow due to pending on network or HTTP server queue.

hanahmily commented 4 months ago

BanyanDB server-side is related to #10561.

wu-sheng commented 3 months ago

BanyanDB server-side is related to #10561.

I have approved that. It should be easy enough to be integrated. The only thing is, @hanahmily we need the new client version to make the codes work.

apache / skywalking