Blockfrost RYO specific improvements

1000101 commented 4 months ago

Hi, I would like to propose additional improvements/additions to the db sync project, which would help to improve speed of RYO in the future.

A separate --blockfrost switch (or more, to allow for granularity) could be introduced, as some of the data might generally be not needed and would take considerably large amount of disk space, since some of the improvements would require additional de-normalized tables. Other improvements include pre-calculation of historic data, which is not live, but takes considerable amount of resources to calculate on the fly.

Running balances

A starting point is to introduce address and payment_cred tables containing a single address just once (as opposed to tx_out), similar to stake_address table, which could then be used in several ways.

e.g.:

CREATE TABLE address (
  id BIGSERIAL,
  address TEXT NOT NULL
);

CREATE TABLE payment_cred (
  id BIGSERIAL,
  payment_cred TEXT NOT NULL
);

Introduce "running" balance (i.e. the current state) of an address/payment_cred and stake_address (stake_address table would need additional balance field). This would greatly improve the overall speed when fetching balance. There are issues with multi assets (for address/payment_cred) though, as those could bloat some addresses, especially those related to exchanges etc. so this would require further thought, but calculating ADA balance is doable right off the bat. Having utxos available sure helps, but I believe it would be even better to also keep the balances.
Ideally, have balance_in and balance_out fields as well, or use those instead of a single balance field

Live data

In addition, I would like to opt for:

Introduce new fields to pool_hash table, or create a new table with live_stake, active_stake, live_delegators, and historical blocks_minted information which is expensive to calculate on the fly.
Introduce assets table, which would contain a given asset with id (appended/ordered by MIN(ma_tx_mint.id) to allow for pagination), and calculated quantity and mint_or_burn_count.
Introduce stats table, containing the network supply: total, circulating, locked, treasury, reserves and total stake stats: live and active - this info is needed for example to calculate pool saturation, live size etc.

Auxiliary data

Next, data which is not live, but still expensive to calculate on the fly:

Introduce stake_amount (SUM) into epoch table (helpful in calculation of historical live_size of a pool).

Introduce pool_history table, which would contain data grouped by epoch, e.g.:

CREATE TABLE pools_history (
epoch_no BIGINT NOT NULL,
pool_id BIGINT NOT NULL,
blocks INTEGER NOT NULL,
delegators INTEGER NOT NULL,
active_size NUMERIC NOT NULL,
active_stake PUBLIC.LOVELACE,
rewards PUBLIC.LOVELACE NOT NULL,
fees PUBLIC.LOVELACE NOT NULL,
);

Denormalized transactions

Due to how pagination works in RYO, transaction history of address, payment_cred and asset will require denormalized tables in order to speed up lookup, e.g.:

CREATE TABLE transactions_address (
  tx_id BIGINT NOT NULL,
  address_id/payment_cred_id/asset_id BIGINT NOT NULL,
  hash TEXT NOT NULL,
  block_height WORD31TYPE,
  block_id BIGINT NOT NULL,
  tx_index WORD31TYPE NOT NULL,
  block_time INTEGER NOT NULL
);

with a requirement of tx_id being consecutive i.e. without any gaps. Otherwise the lookups won't be optimal. This means not relying on pg's serial, but maintaining the ids internally (or using row_number () OVER ... and always selecting the previous MAX).

Practically all aforementioned improvements have been tested one way or the other (in forms or triggers or matviews), so in case further SQL implementation/examples are needed, I am more than happy to share them with the team.

Note: this is a first attempt to address RYO improvements directly in db-sync, so it will definitely get amended and/or edited.

THANKS!

kderme commented 4 months ago

Some initial thoughts on this:

address and payment_cred tables

We've made some work on this. It slowed down syncing speed, so we decided to park it. It could become an optional feauture though, or default if we neutralise its effect with separate performance improvements

Introduce "running" balance Live data

A challenge with these cases is that each tx triggers an update instead of an inertion. When there are updates, handling rollbacks is never easy.

Auxiliary data

DBSync should be able to extract this from the epoch boundary ledger state.

Denormalized transactions

I don't really understand this. Would this have an entry for each tx and address?

rdlrt commented 4 months ago

separate --blockfrost switch

Wut? 😄 Hope if this is prioritized, the nomen clature continues to be based on utility, not companies. There are multiple API providers as well as projects that have requested parts of this over months (besides other featuresets)🙂 . Sorry for knit picking this - but just want to point it out , now that it is part of IO umbrella.

It slowed down syncing speed, so we decided to park it

@kderme - Given the recent config additions, could maybe make that a flag - as discussed in #1333 , the advantage from query pov (+ resource, particularly storage and IOPs consumption) is abundantly higher than initial 1-time sync

Cmdv commented 4 months ago

Wut? 😄 Hope if this is prioritized, the nomen clature continues to be based on utility, not companies. There are multiple API providers as well as projects that have requested parts of this over months (besides other featuresets)🙂 . Sorry for knit picking this - but just want to point it out , now that it is part of IO umbrella.

Fear not our goal is to improve usability/functionality/performance for the whole community who use cardano-db-sync, no favouritism 😅

In all honesty we're really wanting to hear back from the community to know about their real life user cases so we can better steer what is and isn't needed. Also for us as a team to better explain the restrictions we always have to balance when trying to provide new features/fix existing issues and keeping up with new standards etc.

We have started a public Matrix channel (which we'll publicly advertise asap) we hope it will be a great place for informal discussions/questions to happen and open the communication channels up more. Then if things of importance arises on matrix we can move them to github issues with agreed specs and such. 😄

1000101 commented 3 months ago

Some initial thoughts on this:

address and payment_cred tables

We've made some work on this. It slowed down syncing speed, so we decided to park it. It could become an optional feauture though, or default if we neutralise its effect with separate performance improvements

Optional sound reasonable.

Introduce "running" balance Live data

A challenge with these cases is that each tx triggers an update instead of an inertion. When there are updates, handling rollbacks is never easy.

Yes, I understand that. If it helps, I believe I have a working and battle-tested trigger solution (inserts/upserts/rollbacks,...), although just for ADA balances.

Auxiliary data

DBSync should be able to extract this from the epoch boundary ledger state.

Great to hear that!

Denormalized transactions

I don't really understand this. Would this have an entry for each tx and address?

Basically yes. The way we are currently retrieving txs are per address. Since there are some addresses which can be very large and we are using offset pagination instead of cursor (multiple reasons why), data can be very sparse. We need them dense and ideally know which page lands exactly where so we can retrieve the data with constant complexity.

kderme commented 2 months ago

FYI DBSync now has this table https://github.com/IntersectMBO/cardano-db-sync/blob/master/doc/schema.md#pool_stat

It's not populated by default only when "pool_stat": enabled is used in the db-sync insert config.

IntersectMBO / cardano-db-sync

Blockfrost RYO specific improvements #1686