Open Problem: Improve looking up transactions by their hash

Currently our best plan for retrieval of transactions (and receipts) by their hash is to implement a specific network to house this mapping of transaction_hash -> (canonical_block_hash, transaction_index). Some "implications" of this are that:

Seeding data into the network in a self-verifying manner means sending proof data to anchor the transaction hash itself into a specific block through the header.transactions_trie field, and then also potentially providing a proof that the header itself is canonical. This incurs a large-ish degree of proof overhead.

We end up storing the many 100s of millions of these entries. It becomes very difficult to audit this data set. Is the network's index complete? How do we know what is missing?

The data isn't necessarily organized in the manner that is most useful to users, meaning that it might be more useful to be able to lookup all of the transactions sent by an individual address and enumerate them in nonce order... Also, given that there are currently 100's of millions of transactions in the network, it might be more valuable to somehow group this information in some way to reduce the total number of data items stored in the network....

We are interested in clever ideas on how we might better organize and store this data. The rough requirements for any solution we accept would be:

A node that is tasked with storing data would need to be able to verify the legitimacy of the data they are storing.
A node retrieving data from this network should be able to validate the returned data. Self validating is nice but not 100% necessary.
Given a transaction hash, we must be able to lookup that transaction object either directly by retrieving the transaction object or indirectly by providing information on what block the transaction can be found in.

I have been spending a lot of time thinking about this issue. Still working through different aspects but I will jot down my thought-path so far. Whilst this does not solve the problems of a hard-to-audit history-network that has burdensome proof requirements, it does provide a perspective on how to think about the network being most useful to users.

I think that retaining the transaction_hash -> (canonical_block_hash, transaction_index) capability in the history-network is a good idea, and that adding an address-index-network achieves a good balance of user empowerment and proof burden.

Goal

It would be nice to support a small portal network user who has the following characteristics:

Low hard disk availability (~1Gb space).
An address they have used sporadically (~10 times).
An interest in observing their historical activity (interacted with a variety of protocols).
An interest in using historical activity to inform a decision (want to exit a protocol or sell something).
A desire to not to explicitly broadcast their address.

Basic index

The Unchained Index is likely to be at the heart of any solution. A tracing archive node (Erigon) runs TrueBlocks software which looks for the appearance of any address in the course of a transaction execution (even being the callee in a nested contract call). Those are recorded into a chunk, and when chunks are about 25MB they are sealed and uploaded to IPFS. There is also a smart contract that holds the latest IPFS hash.

This is perfect for a full node to quickly access the block number and transaction index for any transaction in which an address appeared. Especially as each chunk is paired with a bloom filter, allowing a subset of the index to be acquired by a node. The index is about 80GB and the bloom filters are about 3GB.

So to get started with the unchained index directly, a small portal node user would have to first get the bloom filters (3GB) and then decide which chunks are relevant for them (10 chunks is 250MB). Then they know which transactions to acquire from peers.

If we decide that portal nodes should just accept they will have 10GB directories total, then this is perhaps the best path.

Derivative index

If the unchained index is re-organised by address, then nodes could obtain a subset that is relavant for them. You produce regular unchained index chunks and periodically (e.g., every 2 two weeks) publish them to peers in a different format.

A good way to do this is to split the index by the first two common hex characters. So data for EOA 0x3cab... and contract 0x3c5a... are in the same group. Thus, the whole index broken into 256 pieces that are about 350MB each. I think this is a reasonable amount of data, bandwidth and privacy, but three common characters is also reasonable.

Index data that is newer can perhaps be shared amongst all peers in a rolling ~4 week window. Data in the 2-4 week old range is obtainable in either the 0x3c group or the newer bucket with all addresses.

Operation

A single portal bridge node running Erigon with tracing (~2TB) and TrueBlocks. The node runs chifra scrape which is an existing command in TrueBlocks that combs the latest uncombed part of the chain. It does this every few hours and peers all collect that data. Then every two weeks (100,000 execution blocks) it runs a process that organises the Unchained Index data into the address-first structure. Peers then listen for the address subset relevant to them.

Specific requirements

Just some thoughts that came up for your requirements, recognising I didn't propose a replacement network.

1. Legitimacy

The validity of a given piece of the index is solved in the Unchained Index by the use of permanent content hashes. A similar system for recording and sharing the agreed hash of every 100,000 block group could be used to mitigate an attacker sharing bad index data.

2. Validation

If the transaction is valid, but does not contain an appearance of the given address, that is a discrepancy between the tracing function of the portal node and trueblocks/erigon system (indicating a transaction execution bug somewhere).

3. Lookup

The portal user uses the index to obtain knowledge of which transaction to obtain. The transaction is most likely obtained from the history-network via the block_body by block_header_hash. The validity of such a transaction is therefore the domain of that network.

Summary

A small portal user who knows which transactions to request will only burden the network with a small number of proof requests. A heavy user can run a larger portal node (30% of the network, thus requiring less proofs) or just run a full node.

The Unchained Index is extremely thorough and picks up more address appearances than other systems designed for a similar purpose. It could be integrated either directly (some storage overhead) or as a derivative index as described suited to lighter nodes.

A user starts their journey by providing their portal node with their wallet address 0x3cab...1234. The node requests ~350MB of address data for addresses starting with 0x3c since genesis. The node then searches that local data for 0x3cab...1234 and discovers which blocks they should request from the history-network. Replaying the relevant transactions locally, they can construct a meaningful history of personal activity and perform basic accounting.

ethereum / portal-network-specs