Investigation into DB queries happening during Syncing

Cmdv commented 4 months ago

Description

This is an investigation into the database queries being ran during syncing which is the slowest process in db-sync. Currently to sync from nothing to the tip of the chain with no data omitted from the configurations. So procedure can take days! (an exact duration hasn't been ascertained as I haven't been able to keep it running in a single go!)

First place of investigation was to use a tool called DbBadger which looks at the output of the postgresql logs and forms a report.

DbBadger Report Overview

Time frame for this report is about 40 minutes of syncing time which was through epoch 352 on mainnet.

SCR-20240604-eaz

The first surprising results was that the high volume of queries were actually SELECT at a whopping 84%

Looking closer at those `SELECT` queries here is a list of the most prominent offenders: Number of times executed	id	query
1,427,906	1	`SELECT "stake_address"."id" FROM "stake_address" WHERE "stake_address"."hash_raw" = ?;`
774,307	2	`SELECT "multi_asset"."id" FROM "multi_asset" WHERE ("multi_asset"."policy" = ?) AND ("multi_asset"."name" = ?);`
696,813	3	`SELECT "tx"."id", "tx"."hash", "tx"."block_id", "tx"."block_index", "tx"."out_sum", "tx"."fee", "tx"."deposit", "tx"."size", "tx"."invalid_before", "tx"."invalid_hereafter", "tx"."valid_contract", "tx"."script_size" FROM "tx" WHERE "tx"."hash" = ?;`
118,598	4	`SELECT "script"."id" FROM "script" WHERE "script"."hash" = ?;`

Queries in codebase

I'd imagine each query will require it's own PR, and I will add more as we proceed. For now here is a look at 1:

queryStakeAddress
             |--> queryStakeAddrWithCacheRetBs
             |                              |--> queryStakeAddrWithCache
             |                              |                      |--> adjustEpochRewards
             |                              |                      |                   |--> insertBlockLedgerEvents
             |                              |                      |--> deleteReward
             |                              |
             |                              |--> queryRewardAccountWithCacheRetBs
             |                                                      |--> queryOrInsertRewardAccount
             |                                                                           |--> insertWithdrawals
             |                                                                           |--> queryOrInsertStakeAddress
             |
             |--> queryStakeAddrAux
                              |--> queryStakeAddrWithCacheRetBs
                                                          |--> queryStakeAddrWithCache
                                                                                  |--> adjustEpochRewards
                                                                                  |--> deleteReward

kderme commented 4 months ago

That's a great analysis. Some explanations for the high percentage of queries:

Some insertions insert hundrend of thousands of entries, that's why they appear to be small in numbers
In most case we decide to use autoincremental ids of referenced hashes. For example in reward.addr_id, epoch_stake.addr_id we use the id instead of the hash directly. Resolving this id requires a query. However replicating the hashes would significantly increase disk space.

That being said, it doesn't mean that there is not room for improvement.

I'd imagine each query will require it's own PR, and I will add more as we proceed.

Do you have some specific solution in mind?

Cmdv commented 4 months ago

In most case we decide to use autoincremental ids of referenced hashes. For example in reward.addr_id, epoch_stake.addr_id we use the id instead of the hash directly. Resolving this id requires a query. However replicating the hashes would significantly increase disk space.

My thought was to create a cache which is populated after a new stake_address is inserted. We'd populate the cache by having the insert of the stake address return the id it was assign and it's hash. Then this could be added to a local Map eg:

type StakeCache = Map StakeAddressHashRaw StakeAddressId

data SyncEnv = SyncEnv {
  ...
  envStakeCache :: (StrictTVar IO StakeCache)
  }

then prior to creating Reward or EpochStake we could do a lookup against envStakeCache to get the StakeAddressHashId from the cache.

The part I'm unsure about is the deletion of values in the cache. Is there a way for us to know how many times the same StakeAddressHashRaw is likely to be called or is it a case that we'd only do one per raw hash?

Cmdv commented 4 months ago

Take that back I can see we have type StakeAddrCache = Map StakeCred DB.StakeAddressId which should be doing what I'm thinking but I can't spot it being populated in the first place.

I'm just trying to log around here, see exactly what is happening:

queryStakeAddrAux ::
  MonadIO m =>
  CacheNew ->
  StakeAddrCache ->
  StrictTVar IO CacheStatistics ->
  Network ->
  StakeCred ->
  ReaderT SqlBackend m (Either (DB.LookupFail, ByteString) DB.StakeAddressId, StakeAddrCache)
queryStakeAddrAux cacheNew mp sts nw cred =
  case Map.lookup cred mp of
    Just addrId -> do
      liftIO $ hitCreds sts
      case cacheNew of
        EvictAndReturn -> pure (Right addrId, Map.delete cred mp)
        _ -> pure (Right addrId, mp)
    Nothing -> do
      liftIO $ missCreds sts
      let !bs = Ledger.serialiseRewardAccount (Ledger.RewardAccount nw cred)
      mAddrId <- mapLeft (,bs) <$> queryStakeAddress bs
      case (mAddrId, cacheNew) of
        (Right addrId, CacheNew) -> pure (Right addrId, Map.insert cred addrId mp)
        (Right addrId, _) -> pure (Right addrId, mp)
        (err, _) -> pure (err, mp)

kderme commented 4 months ago

You can look at how the queryOrInsertRewardAccount and queryOrInsertStakeAddress are used, which are passed one of the three values

data CacheNew
  = CacheNew
  | DontCacheNew
  | EvictAndReturn

EvictAndReturn is used when the stake is deregistered. DontCacheNew is used on simple outputs that use the stake credential and on reward adjustments (that's relatively new). CacheNew is used all other times, ie on registration, receiving of rewards, withdrawals etc, since db-sync assumes they will be reused.

DontCacheNew was used on simple outputs, since an address that was simply used in an output without being registered is not expected to be reused many times. So it allows to have a cache which is reduced in size. It seems though there are addresses that are repeatedly used, which are never registered. These trick the cache, which never caches them.

Ideally we would have some metrics about unregisterd, used addresses and their usage distribution. Caching everything would solve the queries issue, but it may have a negative impact in memory.

Cmdv commented 4 months ago

That's a good point I've been trying to look at memory profiling see if I can work out what is consuming the most memory as right now on my mac:

Real Memory Size: | 14.48 GB
Virtual Memory Size: | 649.18 GB
Shared Memory Size: 5.5 MB |  
Private Memory Size: 14.41 GB

which is quite excessive and needs to be lowered if we can.

Cmdv commented 2 months ago

Some stats about different implementations

StrictSeq + StrictMap memory 16.3GB Epoch 432 [15:27 -> 15:58] ~29mins Stake Addresses: cache sizes: 1287547 and 31415, hit rate: 57%, hits: 1821628, misses: 1321280 TxId: cache size: 299608, cache capacity: 300000, hit rate: 81%, hits: 717021, misses: 158255

Epoch 433 [15:58:01 -> 16:30] ~32mins Stake Addresses: cache sizes: 1286934 and 48715, hit rate: 79%, hits: 5124040, misses: 1338764 TxId: cache size: 287078, cache capacity: 300000, hit rate: 83%, hits: 1672377, misses: 334003

Master - 3 x Maps memory 22.4GB Epoch 432 [09:19 -> 09:52] ~33mins Stake Addresses: cache sizes: 1287547 and 31415, hit rate: 57%, hits: 1821628, misses: 1321280 TxId: cache size: 291335, cache capacity: 300000, hit rate: 81%, hits: 715294, misses: 159982

Epoch 433 [09:52 -> 10:23] ~35mins Stake Addresses: cache sizes: 1286934 and 48715, hit rate: 79%, hits: 5124040, misses: 1338764 TxId: cache size: 267288, cache capacity: 300000, hit rate: 82%, hits: 1662167, misses: 344213

IntersectMBO / cardano-db-sync