use short txids in coinsdb (chainstate) to speed up IBD

Proposal

A large fraction of the coins memory cache is taken up by each entry's 32-byte txid (recall that the unordered map key is a COutPoint, which is the txid and the output index) ... so what if this unordered map was replaced by two unordered maps, same data but one with short txids (say, 4 bytes) and the other the same as today (32-byte txids).

At a high level, when we're verifying a tx input today, we look up its outpoint in the map. If present, then this is a valid outpoint (refers to a known transaction output) and not a double-spend (since this map contains only unspent outputs).

If each map entry requires less memory, then more map entries can fit within a given amount of memory, increasing the cache-hit ratio. Alternatively, the amount of memory for a given hit rate can be reduced.

In the new way, this becomes a two-step process: We first look in the long-key map. If present, then it's the same as today (valid UTXO). If not, we check the short-key map. If it's there, then it's the same as today, the input is valid. It's true that two lookups are required instead of one, but it turns out that nearly all entries will be in the short-key map, so that first lookup should be fast.

When we add an entry, we first check to see if there's an entry with this short txid, if not, we add it to the short-key map. If there is already an entry in the short-key map (there is a collision), then we add it to the long-key map. In other words, the long-key map is only for collisions.

We would also have to write out these short txids to disk, because when it comes time to flush, we no longer know the long txids. But I think this is okay, and actually a feature. There's been some mention of the chainstate now requiring 5 GB of disk and growing, which sounds like very little but on some systems, even that's a concern. This proposal will reduce that size too.

Existing data structures (for review, simplified)

During IBD, the UTXO set changes constantly -- the transactions in each new downloaded block consume UTXOs (their inputs) and create new UTXOs (their outputs). So we want to cache as many UTXOs in memory as possible. (The ones that don't fit are pushed out to LevelDB, and of course also when the node shuts down.)

CCoinsViewCache::cacheCoins is an unordered map of this type:

typedef std::unordered_map<COutPoint, CCoinsCacheEntry, SaltedOutpointHasher> CCoinsMap;

The map key is defined as (not showing methods):

class COutPoint { // primitives/transaction.h
public:
    uint256 hash;
    uint32_t n;
};

The unordered map data:


struct CCoinsCacheEntry { // coins.h
    Coin coin; // The actual cached data.
    unsigned char flags; // DIRTY, FRESH
};

class CTxOut { // primitives/transaction.h
    CAmount nValue;         // int64_t
    CScript scriptPubKey;   // prevector<28, unsigned char>
};

class Coin { // coins.h
    CTxOut out;
    unsigned int fCoinBase : 1;
    uint32_t nHeight : 31;
};

LarryRuane / bitcoin

use short txids in coinsdb (chainstate) to speed up IBD #9

Proposal

Existing data structures (for review, simplified)