ui: improve sql stats fingerprints pages response time

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.13k stars 3.81k forks source link

ui: improve sql stats fingerprints pages response time #97875

Closed xinhaoz closed 1 year ago

xinhaoz commented 1 year ago

On our stmt and txn fingerprints pages, we have seen increasingly long loading times when requesting persisted sql stats for clusters with large amounts of data. This can lead to the page being unusable, as the minimum time range we allow to be requested is the past 1 hour.

We need to provide a fallback so that users can see at least some of their live workload.

Proposed experience:

Provide a new time window option that is only in-memory stats - this uses the stats endpoint pre persisted sql stats that is still lying around (we can display the exact time period by surfacing the cluster setting value)
If a request to persisted stats takes over (some threshold), suggest the above view
Continue to fetch the pending persisted sql stats request in the background -> need to display this to user somehow (stack of previous, cached requests(?) - probably want to keep the cap small as sql stats has large payloads).
Communicate to user when pending request is completed

cc @kevin-v-ngo , @dongniwang

EDIT - NEW APPROACH

We will be reading only from the system table (see discussion below), as the in-memory stats are actually the source of a lot of our response time issues. Limit and sort options will be added as a part of this plan.

Jira issue: CRDB-25071

xinhaoz commented 1 year ago

I'm going to include/address the following issues as a part of this:

j82w commented 1 year ago

Why not make the default to load the in memory first to get the page to load the fastest and then in the background pull persisted stats? That way it's always the same workflow.
Won't the exact time cause confusion because different nodes will flush to persisted table at different intervals, and some nodes might even get restarted? Could the date time picker have a in memory option?

xinhaoz commented 1 year ago

@j82w Great questions --

Why not make the default to load the in memory first to get the page to load the fastest and then in the background pull persisted stats? That way it's always the same workflow.

In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.

Won't the exact time cause confusion because different nodes will flush to persisted table at different intervals, and some nodes might even get restarted? Could the date time picker have a in memory option?

I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes').

WDYT?

j82w commented 1 year ago

In a lot of our default time picker cases we'll be fetching in-memory stats along with the flushed data already. I think it's best not to always incur an additional request if the sql stats system is performing alright. This feature is moreso to offer a reliable alternative while we stabilize sql stats.

I was hoping this feature would also reduce the initial page load times for everyone and not just the customers with large datasets. If we want to focus on just getting the one scenario to work that is fine.

I think just labeling 'in memory' might be more confusing since they won't know what the date range is. The flush interval is a cluster setting used by all nodes, so my thinking here was just to surface what that interval is so we could show that when we surface the in-memory option (e.g. 'Last 10 Minutes').

I'm trying to think of a way to convey that it's not all data from the last 10 minutes, because different nodes will have different times they flush the data. If it just says 'Last 10 Minutes' we will likely get customer issues asking why they don't see a query that ran 2minutes ago because the flush occurred 1 minute ago. @kevin-v-ngo and/or @dongniwang do you have any suggestions?

j82w commented 1 year ago

Data point for system.statement_statistics table with 118k rows:

Over 3 minutes to return all the results
About 8 seconds without an index select * from system.statement_statistics order by statistics->'execution_statistics'->>'cnt' limit 40;

This union to get the top query of most popular columns returns in 3 seconds for some reason. The data might be skewed from how the queries were created.

(select * from system.statement_statistics
order by statistics->'execution_statistics'->>'cnt' limit 40)
union
(select * from system.statement_statistics
order by statistics->'execution_statistics'->'contentionTime'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'cpuSQLNanos'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'rowsRead'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'runLat'->>'mean' limit 40)
union (
select * from system.statement_statistics
order by statistics->'execution_statistics'->'svcLat'->>'mean' limit 40)

Data point for crdb_internal.cluster_statement_statistics table with 100k rows:

About 3 seconds to just generate the view. Probably longer because it's has to fan out to all the nodes.
About 4 seconds to return all the results

Taking the above numbers into consideration would a better solution be to:

Add indexes to the system table and only return the top 40 (2 pages) to always get the result in under 7 seconds. Every time a sort or search occurs it would require a query to the db to get the new results. We could do the union above to get back the top 3 most sorted columns if we think it's an issue like the union query above.
Only show the in memory stats if the time frame is limited to under 30 minutes or the persisted results are less than an hour old.

This solution would allow users to select time frames and avoids the overhead of sending all the data back to the ui to do the computation. The trade off is that each time there is a sort/filter it will take longer.

WDYT?

xinhaoz commented 1 year ago

Adding onto the above -- posting from slack for tracking.

I spoke with @j82w on Friday who initially had some concerns about the in-memory fallback, and after doing some testing of my own with lots of in-memory data (~40k unique fingerprints), I also now feel like providing the in-memory option may not be the reliable solution we want, rather I feel like we should actually do the opposite -- default to not reading from in-memory stats and querying from the system table only (with the new limit and order by features to get interesting statements). When we have a lot of in-memory data (which I assume customers running into the loading issue often do, since the default is past 1 hour) it seems to cause all sorts of problems --

One thing I've noticed is that using the Statements rpc directly over querying the virtual table to get in-memory data is usually a lot faster -- (we don't need to create the virtual table and which aggregates across node boundaries and sorts the data). Unfortunately we would likely not be able to use that since we'd want to aggregate + order in order to limit the data returned.
When we have that much in-memory data, the flush is extremely slow. Since we only clear the in-mem container after a successful flush, it often means we don't clear the in-memory stats as often as we would assume when we have so much data because it takes so long to flush all those entries -- in my local 5 node cluster, my 40k worth of fingerprints didn't even finish flushing after an hour. We should probably put this on our roadmap as a problem to fix. Maybe it would help to lower the number of in-memory stats we allow (currently 100k)?
System table query without needing to join with in-mem seems to perform much better, and we'll be making even more improvements there with the added indexes

Given the virtual table generation and merging weith in-memory stats seems to the be bottleneck in seemingly problematic workloads (ones with high unique fingerprints), how does product feel about not including in-mem stats by default? Our flush interval is already set to 10 minutes, which is pretty frequent. One would then need to wait 10 minutes to see any data appear (after resetting). If it's important to surface that data right after resetting / starting the cluster, maybe an idea there is if we don't return anything from the system table, to then query from in-memory in the interim (we'd still need to make improvments on our in-memory response time to service that reasonably). With these changes I think we could reliably show data as long as there have been some workload data flushed to disk, which is easier to guarantee.

j82w commented 1 year ago

The in-memory results have the following issues.

Latency increases as number of nodes increase. It's required to fan out to all the nodes. The persisted can target just the kv nodes with the data.
The filters/limits are not passed over the wire to the remote nodes. The number of results will likely cause memory issues for large number of statements.
If there is a unhealthy node the results will fail.

Here is a rough diagram explaining the process. Let me know if you have any suggestions or notice any errors.

sequenceDiagram 
    title crdb_internal fanout logic
    participant user
    participant sql
    participant crdb_internal
    participant SQLStatusServer
    participant Local
    user->>sql: Request crdb_internal.cluster_statement_statistics
    sql->>crdb_internal: Generate the view
    crdb_internal->>SQLStatusServer: Requests all stats. No filters or limits are passed in.
    SQLStatusServer->>Local: No filters
    activate Local
    Local-->>SQLStatusServer: Return up to 100k
    deactivate Local
    loop All nodes in cluster. Up to 100 parallel
    SQLStatusServer->>Node2: No filters
    activate Node2
    SQLStatusServer->>Node3: No filters
    activate Node3
    SQLStatusServer->>Node4: No filters
    activate Node4
    Node2-->>SQLStatusServer: Return up to 100k
    deactivate Node2
    Node4-->>SQLStatusServer: Return up to 100k
    deactivate Node4
    Node3-->>SQLStatusServer: Return up to 100k
    deactivate Node3
    end
    SQLStatusServer-->>crdb_internal: Return up to 100k
    crdb_internal->>crdb_internal: Generate the view with 100k rows
    crdb_internal->>sql: Return 100k row view
    sql->>sql: Filter/sort rows
    sql->>user: Results