Open kderme opened 2 years ago
I did some benchmarks and it seems these queries until epoch 329 are execute 82171515 times (equal to the number of inputs at this moment) and they cause an average delay of 0.000298774193s. This is a total delay of 24550s which is a bit less than 7 hours. For comparison the total sync time to this point is 89340s.
So these queries takes more than 25% of sync time.
This should improve syncing time, at the cost of more disk space, and may also slow some queries. We should perhaps establish baselines for each query. There are 45 tables, and users can construct their own queries against the tables. So the bast course of action would be to ask users about the queries they are running. We could ask DevOps team and Rhys (Cardano GraphQL) as well. We could do a survey of external users to find out the queries they use, such as SPOs that run DBSync (Vitor may know). CExplorer.io uses DBSync. Jide Fashola @ Cardano Foundation may know what queries exchanges do with DBSync.
The lookup for tx hash (as against tx ID) could have quite large footprint in memory for query planners - especially when using CTE, which would make querying balances for transactions much slower.
Additionally, for koios (and it's clients):
Since these have to iterate balance for every account frequently, this will likely take a higher toll on resources. This is just to call out that this change will have expensive impact to query layers (while sync occurs once and is managed via snapshots).
Seems this might be of questionable benefit, without being able to more accurately predict performance impact across various use cases.
Here's a question though: would there perhaps be a benefit to leaving the ID and also adding the hash?
The tx_in
table s already one of the largest tables (in epoch 372
):
select count (*) from tx_in ;
count
-----------
135099657
(1 row)
A hash is 32 bytes, so this is at a minimum, right now, is a 4Gig increase in the table size, and this is for a table that will continue to be one of the fastest growing tables.
I would welcome this change for another reason - this field has caused me much confusion.
When writing a join that I now know from experience is meant to look like so...
JOIN tx_in ON tx_in.tx_out_id = tx_out.tx_id AND tx_in.tx_out_index = tx_out.index
I always ended up erroneously writing tx_in.tx_out_id = tx_out.id
, because I expected the matching partner of tx_out_id to be tx_out's id (when really it's tx_out's tx_id).
If it's changed to use tx_out_hash
then it would be instantly clearer what the counterpart is.
@WubbleWobble - You no longer need tx_in join against tx_out, there is now already an additional field in tx_out called consumed_by_tx_in_id
.
@rdlrt Brilliant - thanks for that! Glad I commented now :)
Summary. This is expected to:
Currently for each
tx_in
:tx
table, to find the id by the tx hashtx_out
table to find the ada amount. This is used to calculate thetx.deposits
tx_out
table to retrive the address, which is the script hashEven if these queries are backed by indexes, they still cause big delays, since all these tables
tx_in
,tx_out
,tx
have tens or hundreds of millions of entries.Alternatively we could simply insert the hash, as it is found in the input. This by itself wouldn't make the first query unnecessary, as its result is also needed by the second query.
To avoid the seconds query, we could use the ledger event which gives us the deposits. This ledger event will have to change in the ledger, as it currently doesn't include the
tx_hash
(BLOCKED: investigate this with ledger team). Whendisable-ledger
is used, this event is not created, so we would have to either do the queries and pay the performance cost or make deposits nullable and add it to https://github.com/input-output-hk/cardano-db-sync/blob/master/doc/configuration.md#--disable-ledger (BLOCKED: decide with product team/users).The third query rarely happens, as only a small percentage of inputs have a redeemer attached, so we can keep it.