Technical design: Logs and events

raulk commented 2 years ago

Context

The Ethereum blockchain has the concept of logs, which are events emitted from smart contracts during execution. Logs contain arbitrary data, and are annotated with zero to four 32-byte topics depending on the opcode used (LOG0..LOG4). The fields from logs (topics, data, emitting address) are added to a 2048-bit bloom filter which is then incorporated to the block header.

The bloom filter is important because it is used by:

light clients and wallets to quickly evaluate if a block is of interest depending on what they are looking for.
full nodes to service log-related JSON-RPC queries (eth_getLogs, eth_getFilterLogs, eth_getFilterChanges); either in a streaming or polling fashion. Filter support implies tracking state at the node level.

AFAIK logs in Ethereum are not part of the world state, i.e. they are not stored in the state tree (we need to double check this). They are just emitted during execution and consensus is arrived to through the bloom filter, gas used, and other outputs.

Requirements

The EVM compatiblity in Filecoin will need to support Ethereum logs at the protocol level and the JSON-RPC level. We should avoid overfitting to Ethereum's needs -- this feature should be available to native actors too and should be generally usable and accessible.

Possible design direction

At this stage, we do not plan on introducing modifications to the chain data structures, so populating an aggregation of logs in block headers is a no-go. That leaves us with three options:

Support at the node level, by offering a syscall that traverses an extern, and records logs somewhere in node land.
- Con: this does not cause a result in commitment of emitted logs (does factor into consensus indirectly through gas fees).
Support at the FVM level, by offering a syscall that buffers logs and appends to a bloom filter. When finalizing the machine, we'd return the bloom filter and logs to the node. The node can store bloom filters and do whatever with the logs themselves (strema them, cache/store them, or rely on re-execution).
- Con: this also doesn't result in a chain commitment. However, we could track the bloom filter's content as a field in the system actor, which would be updated implicitly by the FVM on machine finalization.
Support at the system actor / built-in actor level through a singleton actor. Logs are emitted by calling a LogsActor. The logs actor stores a height => bloom filter mapping and exposes a GetLogsBloom(height) to return it. We'd need to add a cron job to prune LogActor entries and limit them to the current finality. Getting the logs would require re-execution and introspection of call parameters through execution traces.

Light client operation

In Ethereum, light clients monitor block headers containing event bloom filters to determine whether they want to act on a block. Since Filecoin does not include the logs blooms in a chain structure, Filecoin light clients would operate by receiving the current bloom in the system actor accompanied by a merkle inclusion proof.

Stebalien commented 2 years ago

~https://github.com/filecoin-project/ref-fvm/issues/784~

edit: hm. Wrong issue.

Stebalien commented 2 years ago

On bloom filters, we should revisit that decision from first principles.

2048 bits is rather small. See https://medium.com/@naterush1997/eth-goes-bloom-filling-up-ethereums-bloom-filters-68d4ce237009.
We have IPLD and light clients can fetch partial state over IPLD.

From that, I'd say we should:

Consider storing events (or at least the keys) in a HAMT (reset every epoch). Clients can download only the parts of the HAMT that they need.
If we still need a bloom filter (likely easier for quick light-client checks), we should probably make the size dynamic depending on the number of events. This isn't something we can reasonably do if we put it into the block header itself, but it's something we can do if we put it in the state-tree.

Where to put them...

I'd prefer to hang them off the block (treat them like receipts), we should talk with the core implementers to see how difficult this would be. For example, we could change BlockHeader.ParentMessageReceipts to actually be BlockHeader.ParentArtifacts (or something like that), including receipts, events, and anything else we need to stash in the block header. This should be quite doable (even simple) given few components interact with the receipts.

If not, storing them in an actor isn't the end of the world. However:

I'd just clear the list on every epoch.
I wouldn't make the events available to other actors, that's not really what these are for.

Stebalien commented 2 years ago

Resolution from the discussion today:

Add a new field to the block for execution "artifacts", including logs/events.
These artifacts will include a variable-sized bloom filter and an event "index" (HAMT).

Specifically, something like:

type BlockHeader struct {
    Miner address.Address // 0 unique per block/miner
    // ...
    ParentArtifacts cid.Cid
}

type ExecutionArtifacts struct { // name TBD
    // A variable-sized bloom filter to quickly tell what events may exist.
    EventBloomFilter []byte

    // An AMT of all events.
    Events cid.Cid

    // A HAMT indexing events mapping index keys to indices in the Events AMT.
    EventIndex cid.Cid
}

Design rational:

Avoid putting events into the state-tree itself.
- This makes garbage collection easier. By linking to them in the block header, we can, store them in a separate blockstore and easily garbage collect them generationally.
- Nodes that don't care about events can simply discard them (after computing the HAMT root).
Store the actual events so that light-clients, and potentially other chains, can learn about them.
Put them in a separate object (not directly in the header) so we can add more fields easily, dynamically size the bloom filter, etc.

Drawbacks:

Creating the HAMT will require more hashing than simply inserting events into a bloom filter.

Open Questions:

Ethereum also stores the logs in the transaction receipt. We'll probably need to do this as well.
We need to figure out the details of what we actually want to index.
We need to handle the fact that event values are potentially arbitrary sizes (and probably need to set a sane maximum.

Stebalien commented 2 years ago

@raulk we should probably discuss the open questions in standup before continuing here.

Stebalien commented 2 years ago

Next step: Write up a series of use-cases to better understand the problem.

raulk commented 2 years ago

Use cases include:

Serving Ethereum JSON-RPC logs related
Lightweight observation of chain events (evaluating my interests against the bloom filter, or similar)
Light clients (how would light clients would trustlessly query for events, and what would the patterns of access be: per message, per "topic" in a block, per emitting address, matching on data, etc. How would the inclusion proofs look for those)

raulk commented 2 years ago

We will need to associate the logs to the concrete messages that emitted them. Ethereum does this by embedding the logs in the receipt (including a bloom filter, which I don't know if it's scoped to the logs in that message, or the cumulative bloom filter up until then; I'd imagine the former). One idea is to have a top level structure vector structure collecting all logs from the tipset, and receipts would contain bitfields addressing the emitted logs via their index into the vector. However, this makes producing inclusion proofs harder (I think), and it makes the message receipts less useful by themselves.

anorth commented 2 years ago

@Stebalien what is "index keys" that are the keys of the HAMT?

I agree that logs/events need to be referenced from the message receipts in order to be most useful to light clients, UIs etc. If we put such structure in the message receipts, then do we need the events and index in the block at all? They're committed via the receipts root CID.

Stebalien commented 2 years ago

what is "index keys" that are the keys of the HAMT?

TBD. We want to make it possible for a light client to get a succinct (and cheap) proof that some event did or did not happen in any given block.

Likely:

actor id + event key
actor id
actor code + event key
actor code

But I'm a bit concerned that the HAMT could grow large.

I agree that logs/events need to be referenced from the message receipts in order to be most useful to light clients, UIs etc. If we put such structure in the message receipts, then do we need the events and index in the block at all? They're committed via the receipts root CID.

Unfortunately, light clients would have to download all messages and receipts (including top-level return values) for that to work. We'd like light clients to be able to download just:

The chain.
The root "artifacts" block.

Then, if their event is in the bloom filter:

Enough of the HAMT to figure out where the event happened.
The relevant event/message.

Stebalien commented 2 years ago

Concrete proposal:

Introduce a new log that takes a set of log topics and a block ID.
Do NOT index anything (yet). Indexing will be handled in a followup FIP.

fn log(count: u32, topics: *const u8, value: BlockId)

Where:

count is the number of topics (1-4 for now).
topics is a byte slice with the length 32*topics. Each topic is an arbitrary 32bit key topics[i*32..(i+1)*32].
value is a block ID of a value.

Define an event object of the type:

struct Event {
    actor: ActorID,
    topics: Vec<u8>,
    value: cid.Cid,
}

When an event is logged:

Make a CID of the value block where:
1. The CID is "inline" if the length of the value is <= 32 bytes.
2. Otherwise, we hash with blake2b.
Record an event object with the caller's ActorID, the specified topics, and the value CID.

When creating a message receipt, pack all events into an AMT in-order and include the AMT root in the receipt.

Decisions

[x] Are we OK with fixed 32byte topics? That comes from the EVM, but it's a useful constraint.
[x] We're storing the value in either an inline CID or an external block because it might be large (and we want the AMT to have predictably sized nodes. Is that OK?
[x] Are we OK storing the topics in a single vector (concatenated)? It simplifies things, but users may not like it?

raulk commented 2 years ago

Notes from sync design meeting + concrete proposals

Descoping indices

We are moving the indexes out of the scope of this solution. Right now we want to focus on the simplest, extensible solution that: (a) is not overengineered for what we need now, (b) does not back us into a design corner now without sufficient information, (c) is easily extensible in the future.

Storing raw events

For now, we will be storing the raw events only, allowing clients to experiment and generate indexes client side entirely. The schema of an event is as follows:

(see @Stebalien's comment above)

During execution, the Call Manager adds emitted events to the blockstore and populates an AMT tracking the Cids of those event objects.

Commitment on chain

We extend the Receipt chain data structure with a new field:

pub struct Receipt {
    // existing fields
    exit_code: ExitCode,
    return_data: RawBytes,
    gas_used: i64,
    // new field
    events: Cid,
}

When the message is finalized, we return the Receipt with the events field populated.

Patterns of access

While the protocol does not mandate this, clients may wish to cache events in a local database for efficient access. With the structure above, it's possible to access events for a given message or all events for a tipset by returning events from all receipts.

Ethereum JSON-RPC compatibility

At this stage, we do not track logs blooms and we definitely do not track Ethereum formatted blooms (fixed size keccak256 based hashing). The Ethereum JSON-RPC API will need to recreate the bloom filters on demand (or implementations could choose to do something different if they wish to optimise for faster bloom query).

raulk commented 1 year ago

Draft FIP at https://github.com/filecoin-project/FIPs/pull/483.

raulk commented 1 year ago

We can consider the technical design phase to have finished, culminating with the FIP draft at https://github.com/filecoin-project/FIPs/pull/483. Closing this issue.

filecoin-project / ref-fvm