Transaction pruning - Githubissues

OttoAllmendinger commented 5 years ago

The naive way of storing tx metadata keyed by txid requires 32 bytes of overhead (txid) per transaction. Using 400k txs/per day as a high estimate (source), this requires about 12mb of noncompressible storage per day.

Since we require a bitcoind instance anyway, we can use a simple compression scheme to reduce the storage requirement. Once a block has enough confirmations (144), we can store the tx metadata for transactions in that block as a list [m_1, m_2, ... m_n] in the same order as the transactions in the block.

This makes reconstructing the mempool a bit more tricky: instead of querying all txs with firstSeen < t < confirmationTime, we must resolve pruned block data backwards.

Ideally we should store orphaned blocks around as well and consider reorgs in the mempool reconstruction.

0xB10C commented 5 years ago

I had thought about a sightly different approach to transaction compression, but I don't think this would work with ~orphan~ stale blocks. However:

I'm not sure if we need stale block tracking for mempool recordings...
I guess the 144 confirmations are probably incompatible with the stale blocks. My rationale for choosing 144 confirmations was to have no problems with reorgs and stale blocks.

Since we'll use a SQL based storage I'd allow the txid field in the Transaction SQL-table to be NULL and have a foreign key blockHeight and a integer field indexInBlock indicating the block position, which can be NULL as well. The process would be the following:

An INSERT would put the transaction in the table with the txid set. blockHeight and indexInBlock would both be ǸULL. There should be a database index over the txid.
The compression after 144 confirmations (i.e. bitcoin-cli getblockhash {current_block_height - 144} and then bitcoin-cli getblock {hash} 0 and then deserialize) would insert a new block into the Block SQL-table. Each transaction in that block would then be updated (remove txid, set blockHeight and indexInBlock). (TODO: needs discussion on how to handle transactions not found in our mempool recordings)
The uncompression of the data (i.e. when txids are needed) could be done by accumulation all needed block hashes for the query and then querying those from Bitcoin Core. The indexInBlock can then be used to get the txid. (As an alternative it would be possible to include or later add a table consisting of blockHeight indexInBlock txid for users with no database space restrains. But that not needed for an MPV I think.)

I think the compression process can be triggered every time a new block arrives over the ZMQ topic rawblock. With BIP34 (since Block 227,836) the blockHeight is included in the the deserialized block's coinbase. See code example https://github.com/0xB10C/memo/blob/591a968802b63983fd399ebbb0488d0103173a33/memod/processor/zmq_handler.go#L54-L76

OttoAllmendinger commented 5 years ago

Your approach requires a few more bytes per transaction. However, that data is highly compressible and keeping the transaction table simplifies queries like firstSeen < t < confirmationTime, so we should go with it.

The uncompression of the data (i.e. when txids are needed) could be done by accumulation all needed block hashes for the query and then querying those from Bitcoin Core. The indexInBlock can then be used to get the txid.

I wonder how often txids are actually needed. Use cases were only the feeRate distributions and confirmation times are needed (fee estimation) do not require the txid. Omitting txid by default from the Mempool interface might allow significant speedups.

0xB10C / bademeister-go

Transaction pruning #9