Open Cmdv opened 4 months ago
That's a great analysis. Some explanations for the high percentage of queries:
reward.addr_id
, epoch_stake.addr_id
we use the id instead of the hash directly. Resolving this id requires a query. However replicating the hashes would significantly increase disk space.That being said, it doesn't mean that there is not room for improvement.
I'd imagine each query will require it's own PR, and I will add more as we proceed.
Do you have some specific solution in mind?
In most case we decide to use autoincremental ids of referenced hashes. For example in reward.addr_id, epoch_stake.addr_id we use the id instead of the hash directly. Resolving this id requires a query. However replicating the hashes would significantly increase disk space.
My thought was to create a cache which is populated after a new stake_address
is inserted.
We'd populate the cache by having the insert of the stake address return the id it was assign and it's hash. Then this could be added to a local Map eg:
type StakeCache = Map StakeAddressHashRaw StakeAddressId
data SyncEnv = SyncEnv {
...
envStakeCache :: (StrictTVar IO StakeCache)
}
then prior to creating Reward
or EpochStake
we could do a lookup against envStakeCache
to get the StakeAddressHashId
from the cache.
The part I'm unsure about is the deletion of values in the cache. Is there a way for us to know how many times the same StakeAddressHashRaw
is likely to be called or is it a case that we'd only do one per raw hash?
Take that back I can see we have type StakeAddrCache = Map StakeCred DB.StakeAddressId
which should be doing what I'm thinking but I can't spot it being populated in the first place.
I'm just trying to log around here, see exactly what is happening:
queryStakeAddrAux ::
MonadIO m =>
CacheNew ->
StakeAddrCache ->
StrictTVar IO CacheStatistics ->
Network ->
StakeCred ->
ReaderT SqlBackend m (Either (DB.LookupFail, ByteString) DB.StakeAddressId, StakeAddrCache)
queryStakeAddrAux cacheNew mp sts nw cred =
case Map.lookup cred mp of
Just addrId -> do
liftIO $ hitCreds sts
case cacheNew of
EvictAndReturn -> pure (Right addrId, Map.delete cred mp)
_ -> pure (Right addrId, mp)
Nothing -> do
liftIO $ missCreds sts
let !bs = Ledger.serialiseRewardAccount (Ledger.RewardAccount nw cred)
mAddrId <- mapLeft (,bs) <$> queryStakeAddress bs
case (mAddrId, cacheNew) of
(Right addrId, CacheNew) -> pure (Right addrId, Map.insert cred addrId mp)
(Right addrId, _) -> pure (Right addrId, mp)
(err, _) -> pure (err, mp)
You can look at how the queryOrInsertRewardAccount
and queryOrInsertStakeAddress
are used, which are passed one of the three values
data CacheNew
= CacheNew
| DontCacheNew
| EvictAndReturn
EvictAndReturn
is used when the stake is deregistered. DontCacheNew
is used on simple outputs that use the stake credential and on reward adjustments (that's relatively new). CacheNew
is used all other times, ie on registration, receiving of rewards, withdrawals etc, since db-sync assumes they will be reused.
DontCacheNew
was used on simple outputs, since an address that was simply used in an output without being registered is not expected to be reused many times. So it allows to have a cache which is reduced in size. It seems though there are addresses that are repeatedly used, which are never registered. These trick the cache, which never caches them.
Ideally we would have some metrics about unregisterd, used addresses and their usage distribution. Caching everything would solve the queries issue, but it may have a negative impact in memory.
That's a good point I've been trying to look at memory profiling see if I can work out what is consuming the most memory as right now on my mac:
Real Memory Size: | 14.48 GB
Virtual Memory Size: | 649.18 GB
Shared Memory Size: 5.5 MB |
Private Memory Size: 14.41 GB
which is quite excessive and needs to be lowered if we can.
Some stats about different implementations
StrictSeq + StrictMap memory 16.3GB Epoch 432 [15:27 -> 15:58] ~29mins Stake Addresses: cache sizes: 1287547 and 31415, hit rate: 57%, hits: 1821628, misses: 1321280 TxId: cache size: 299608, cache capacity: 300000, hit rate: 81%, hits: 717021, misses: 158255
Epoch 433 [15:58:01 -> 16:30] ~32mins Stake Addresses: cache sizes: 1286934 and 48715, hit rate: 79%, hits: 5124040, misses: 1338764 TxId: cache size: 287078, cache capacity: 300000, hit rate: 83%, hits: 1672377, misses: 334003
Master - 3 x Maps memory 22.4GB Epoch 432 [09:19 -> 09:52] ~33mins Stake Addresses: cache sizes: 1287547 and 31415, hit rate: 57%, hits: 1821628, misses: 1321280 TxId: cache size: 291335, cache capacity: 300000, hit rate: 81%, hits: 715294, misses: 159982
Epoch 433 [09:52 -> 10:23] ~35mins Stake Addresses: cache sizes: 1286934 and 48715, hit rate: 79%, hits: 5124040, misses: 1338764 TxId: cache size: 267288, cache capacity: 300000, hit rate: 82%, hits: 1662167, misses: 344213
Description
This is an investigation into the database queries being ran during syncing which is the slowest process in db-sync. Currently to sync from nothing to the tip of the chain with no data omitted from the configurations. So procedure can take days! (an exact duration hasn't been ascertained as I haven't been able to keep it running in a single go!)
First place of investigation was to use a tool called DbBadger which looks at the output of the postgresql logs and forms a report.
DbBadger Report Overview
Time frame for this report is about 40 minutes of syncing time which was through epoch 352 on mainnet.
The first surprising results was that the high volume of queries were actually
SELECT
at a whopping 84%SELECT
queries here is a list of the most prominent offenders: Number of times executedSELECT "stake_address"."id" FROM "stake_address" WHERE "stake_address"."hash_raw" = ?;
SELECT "multi_asset"."id" FROM "multi_asset" WHERE ("multi_asset"."policy" = ?) AND ("multi_asset"."name" = ?);
SELECT "tx"."id", "tx"."hash", "tx"."block_id", "tx"."block_index", "tx"."out_sum", "tx"."fee", "tx"."deposit", "tx"."size", "tx"."invalid_before", "tx"."invalid_hereafter", "tx"."valid_contract", "tx"."script_size" FROM "tx" WHERE "tx"."hash" = ?;
SELECT "script"."id" FROM "script" WHERE "script"."hash" = ?;
Queries in codebase
I'd imagine each query will require it's own PR, and I will add more as we proceed. For now here is a look at
1
: