Speed up aggregate queries

jbearer commented 7 months ago

We have a number of queries that currently compute some kind of aggregate over large tables:

sync_status
transaction_count
payload_size These can become quite slow when the chain gets long.

One way to speed these up would be to cache their results in a dedicated, one-row table, and use insert/update triggers to keep them up to date. However, this would not reflect if the database spontaneously lost data (e.g. a bad update). While this should be extremely, at least for sync_status part of the point of the query is to detect and alert on these rare error cases.

Another approach would be to keep the queries the same but run them incrementally (i.e. add from_height, to_height parameters). Since height is indexed on all tables, narrowing the query to a certain height range can make it run in constant (or logarithmic) time. We can divide the table into fixed-size ranges and keep the summary statistics in memory for each range. Then we can run a background task, similar to the proactive scanner, that iterates over each range, keeping the in-memory statistics in sync with the database. Then answering a query just requires aggregating the in-memory range statistics. We can even update the overall aggregates efficiently, e.g. updating a count or a sum just involves computing the difference in the range that we just synced.

With this strategy, spontaneous loss of data would be reflected in the statistics, although there may be some delay, before we re-index the range of the database that lost data. This is probably fine, the thing that matters is we would still get the alert if data was lost.

jbearer commented 7 months ago

A simpler thing along similar lines would be to just cache the overall result in memory, and run the query over the entire database on a schedule, updating the cached result. This is much simpler and uses less memory than described above, and since the query is not on the critical path and cannot be triggered by clients, it's not as big a deal if it is expensive to run. The main downside is higher latency between updates.

imabdulbasit commented 6 months ago

How about materialized views that have only one row (e.g total_leaves, null_payloads) are refreshed every time we prune data? Each view also has a height column. The result of queries such as sync_status() would be to aggregate values from the view, along with results from the table where height > height from the view (union query) This would avoid a full table scan. Additionally, adding a partial index so that the IS NULL filter uses an index will make queries faster.

jbearer commented 6 months ago

I like it

jbearer commented 6 months ago

Hmm actually...this means if we managed to recover data from the portion of the table covered by the materialized view, sync_status wouldn't reflect it until the view was refreshed. This makes it much less useful for monitoring and especially the block explorer, where we would want to take down the warning that data may be out of data as soon as possible.

What if instead of storing counts in the materialized view, we stored the actual list of heights that were missing (excluding pruned dadta). Then the sync status query would outer join the view with the real table to get the number of entires currently missing from that range, add the number of pruned rows, and add the number of missing entries from scanning height > materialized view height. This still only requires scanning the number of rows which were missing when the view was last refreshed, plus the number of rows added since then. Number of missing rows at any given time is almost always going to be very small.

imabdulbasit commented 6 months ago

For a new service, wouldn't all the data be missing?

jbearer commented 6 months ago

Yes. So the sync status query would be slow at first and gradually speed up as the service caught up. IMO that is acceptable

jbearer commented 6 months ago

We could potentially optimize it even further with a regime change. Ie in the beginning the number of missing rows is very large but the number of present rows is very small, so it's actually quite fast to just do a full table scan. If we can have some heuristic for switching between these two regimes we can have very fast lightweight sync_status queries all the time

EspressoSystems / hotshot-query-service

Speed up aggregate queries #421