Replace `tx_in.tx_out_id` by `tx_in.tx_out_hash`

kderme commented 2 years ago

Summary. This is expected to:

reduce sync time
cause a schema change which will significantly affect users downstream (BLOCKED: decide with product team/users)
needs a migration which doesn't require resync
increase disk size a bit

Currently for each tx_in:

a query is done to the tx table, to find the id by the tx hash
the result is used to do a query to the tx_out table to find the ada amount. This is used to calculate the tx.deposits
if there is a redeemer on this input, a query may be done again to the tx_out table to retrive the address, which is the script hash

Even if these queries are backed by indexes, they still cause big delays, since all these tables tx_in, tx_out, tx have tens or hundreds of millions of entries.

Alternatively we could simply insert the hash, as it is found in the input. This by itself wouldn't make the first query unnecessary, as its result is also needed by the second query.

To avoid the seconds query, we could use the ledger event which gives us the deposits. This ledger event will have to change in the ledger, as it currently doesn't include the tx_hash (BLOCKED: investigate this with ledger team). When disable-ledger is used, this event is not created, so we would have to either do the queries and pay the performance cost or make deposits nullable and add it to https://github.com/input-output-hk/cardano-db-sync/blob/master/doc/configuration.md#--disable-ledger (BLOCKED: decide with product team/users).

The third query rarely happens, as only a small percentage of inputs have a redeemer attached, so we can keep it.

kderme commented 2 years ago

I did some benchmarks and it seems these queries until epoch 329 are execute 82171515 times (equal to the number of inputs at this moment) and they cause an average delay of 0.000298774193s. This is a total delay of 24550s which is a bit less than 7 hours. For comparison the total sync time to this point is 89340s.

So these queries takes more than 25% of sync time.

marshada commented 2 years ago

This should improve syncing time, at the cost of more disk space, and may also slow some queries. We should perhaps establish baselines for each query. There are 45 tables, and users can construct their own queries against the tables. So the bast course of action would be to ask users about the queries they are running. We could ask DevOps team and Rhys (Cardano GraphQL) as well. We could do a survey of external users to find out the queries they use, such as SPOs that run DBSync (Vitor may know). CExplorer.io uses DBSync. Jide Fashola @ Cardano Foundation may know what queries exchanges do with DBSync.

rdlrt commented 2 years ago

The lookup for tx hash (as against tx ID) could have quite large footprint in memory for query planners - especially when using CTE, which would make querying balances for transactions much slower.

Additionally, for koios (and it's clients):

we also had to create stake snapshots (set, mark and go against each account) , a job that has to iterate through every account balance in past 2 epochs
live stake distribution (refreshed every 30 mins) built as cache

Since these have to iterate balance for every account frequently, this will likely take a higher toll on resources. This is just to call out that this change will have expensive impact to query layers (while sync occurs once and is managed via snapshots).

marshada commented 2 years ago

Seems this might be of questionable benefit, without being able to more accurately predict performance impact across various use cases.

Here's a question though: would there perhaps be a benefit to leaving the ID and also adding the hash?

erikd commented 2 years ago

The tx_in table s already one of the largest tables (in epoch 372):

select count (*) from tx_in ; 
   count   
-----------
 135099657
(1 row)

A hash is 32 bytes, so this is at a minimum, right now, is a 4Gig increase in the table size, and this is for a table that will continue to be one of the fastest growing tables.

WubbleWobble commented 11 months ago

I would welcome this change for another reason - this field has caused me much confusion.

When writing a join that I now know from experience is meant to look like so... JOIN tx_in ON tx_in.tx_out_id = tx_out.tx_id AND tx_in.tx_out_index = tx_out.index

I always ended up erroneously writing tx_in.tx_out_id = tx_out.id, because I expected the matching partner of tx_out_id to be tx_out's id (when really it's tx_out's tx_id).

If it's changed to use tx_out_hash then it would be instantly clearer what the counterpart is.

rdlrt commented 11 months ago

@WubbleWobble - You no longer need tx_in join against tx_out, there is now already an additional field in tx_out called consumed_by_tx_in_id.

WubbleWobble commented 11 months ago

@rdlrt Brilliant - thanks for that! Glad I commented now :)

IntersectMBO / cardano-db-sync

Replace `tx_in.tx_out_id` by `tx_in.tx_out_hash` #1289