Refactoring event Indexing and simplifying the ETH events RPC flow (and fix a bunch of known issues with it)

aarshkshah1992 commented 5 months ago

Checklist

[X] This is not brainstorming ideas. If you have an idea you'd like to discuss, please open a new discussion on the lotus forum and select the category as Ideas.
[X] I have a specific, actionable, and well motivated feature request to propose.

Lotus component

[ ] lotus daemon - chain sync
[ ] lotus fvm/fevm - Lotus FVM and FEVM interactions
[ ] lotus miner/worker - sealing
[ ] lotus miner - proving(WindowPoSt/WinningPoSt)
[X] lotus JSON-RPC API
[ ] lotus message management (mpool)
[X] Other

What is the motivation behind this feature request? Is your feature request related to a problem? Please describe.

The current Chain Notify <> Event Index <> Event Filter Management <> ETH RPC Events API is racy, causes missed events, is hard to reason about and has known problems such as lack of automated backfilling, returning empty events for tipsets on the cannonical chain even though the tipset has events, not using the event Index as the source of truth etc. This issue aims to propose a new architecture/flow for event indexing and filtering to fix all of the above.

Describe the solution you'd like

After thinking deeply about this and going through the code -> I think our Chain Notify <-> Events Indexing <> Event Filter management <> ETH RPC Events API flow is racy/needs some refactoring to really fix the "events availability" problem for clients among some of the other issues we've run into.
Really, if the event index is enabled , then the event index should be the source of truth for all events that are served to clients via the Event RPC APIs
Given that the event index is needed for the correct functioning of the EthGetLogs , EthGetFilterChanges and EthGetFilterLogs anyways, I'd posit that we can really simplify and streamline things if these along with the EthSubscribe API use the event Index as the source of truth
But that is not the case currently. We do a dual write from the head changes to both the event index and the filter buffers: See https://github.com/filecoin-project/lotus/blob/6f821c3285e801a76d76d39444919c61acfbf627/chain/events/filter/event.go#L333 and then https://github.com/filecoin-project/lotus/blob/6f821c3285e801a76d76d39444919c61acfbf627/chain/events/filter/event.go#L340
The event filter buffers are used to serve the EthGetFilterLogs and EthGetFilterChanges APIs that are supposed to return the events for a given filter since the last time it was polled. The EthSubscribe and EthGetLogs APIs don't even need these buffers
For the EthGetFilterLogs and EthGetFilterChanges APIs, the way it works is that when the filter is first created, we "prefill" it with matching events from the Index DB and then rely on the buffer getting updated with events from tipset updates sent by ChainNotify. Every time the client polls these APIs, we return what we have in the buffer -> empty out the buffer -> start again (again relying solely on tipset updates -> the event index is no longer in the picture)
But, this is racy. See one example at https://github.com/filecoin-project/lotus/issues/12111 (there's no synchronization b/w prefilling from the DB and updating the buffer from the tipset updates -> this can cause lost event updates for the client)
In addition to this, we also have these known problems:
- We don't know if a tipset simply does not have events or if the index DB hasn't seen that tipset -> what should the API return to the client ?
- No automated backfilling of the Index DB (creates "holes" in the Index DB with events that should exist but don't) -> bad UX with node operators having to figure out backfilling on their own
- Async event processing causes clients to see no events for a tipset even though the tipset is already a part of the cannonical chain

I'd suggest the following refactor to fix all of the above and make this code easy to reason about

The Event Index DB becomes the source of truth (effectively addressing https://github.com/filecoin-project/lotus/issues/11830 )
Tipset updates (applies and reverts) coming from ChainNotify get written to a channel that the Event Index consumes from and applies event updates to the Index DB linearly one at a time(to not race between applies and reverts) -> there is no dual write to the filter buffers
Every update applied to the Event Index DB has a monotonically increasing ID
For a tipset with no events, the Event Index is updated with an empty event entry field ("I've seen this tipset but it has no events")
The Event Index allows subscription to an event stream containing the updates it makes to the DB
On subscribing to this stream("index stream"), a client gets the latest update made to the Index DB immediately and from there on, every subsequent Index update is published to the subscriber
Now here's how all the ETH RPC APIs work:
- EthSubscribe -> subscribes to the "index stream" -> forwards these updates to the RPC client's channel
- EthGetLogs -> subscribes to the "index stream" -> keeps consuming till it sees an update containing the maxheight requested by the user -> queries the DB for events matching the client's filter -> sends out the results
- EthGetFilterLogs and EthGetFilterChanges
  - For each filter, the only state we maintain is the "last sent Index update ID" and "last seen Index update ID" -> filter subscribes to the "index stream" -> keeps updating it's "last seen Index update ID"
  - When client polls the API -> send everything between "last sent update" <> "last seen update" and reset this state
When the Index boots up, it looks at it's own "highest non reverted tipset" -> looks at the current head of the chain as per ChanNotify -> fills in the all the missing updates in between ("automated backfilling") and only then processes tipset updates and "index stream" subscriptions.

aarshkshah1992 commented 5 months ago

cc @Stebalien @rvagg @raulk

rvagg commented 5 months ago

Some initial thoughts:

https://github.com/filecoin-project/lotus/issues/11830 has a slightly different perspective on "source of truth", but I think it can be reconciled with this proposal; you're basically saying that we should entirely ignore the blockstore for anything other than the initial ingest.
- The content-addresser in me is quite uncomfortable with this all-in on the db to be honest! I'm already very uncomfortable with our events db and its consistency with the actual events and would like to see any more investment in this coupled with some ongoing consistency-checking. Backfilling isn't enough IMO, we need to be checking historical to make sure we have it right. Mostly I just want to see log errors when we find that we're inconsistent. If we never see errors then maybe there's nothing to worry about, but right now I'm worried.
- On the other hand, doing it only one way would probably be better than the split way we do it now! So I'm in agreement that the current situation of db + chain read isn't ideal.
You need to address what we do with the current situation where historical events can be disabled but you can still get access to real-time subscription APIs. See all of the EventIndex == nil branches in the code. Right now, you can run without the events db but still subscribe to both the eth filter APIs and SubscribeActorEventsRaw as long as you don't ask for anything historical in your filter. I don't know if anyone uses this, I did see @ribasushi was trying it out a couple of months ago. But it is theoretically a nice option to not have more filesystem cruft if you just want latest info. Do we rule that out as an option? Or re-architect this flow such that it still does its thing through one place but has the option to not do it through an sqlite db?

rvagg commented 5 months ago

Oh, and further to that last point, we really do need GC on this thing if we're going to make it a necessary add-on. Maybe then it becomes much less an issue for people to turn on. If it GCs in time with splitstore and you only ever have less than a week's worth of events then there's less to be concerned about. I have a 33G events.db that's been collecting since nv22. Those who have been running it since FEVM must have much larger databases.

aarshkshah1992 commented 5 months ago

@rvagg

Agree on the GC part -> we should be GC'ing events when we GC state
Wrt handling the EventIndex == nil part, I'd be surprised if there's a valid use case for anything other than the Subscribe API in that case. Again, can talk to users to figure this out but we can support that easily by giving the filters a Publisher and that Publisher can be the Index DB if enabled or the ChainNotify stream if disabled.

aarshkshah1992 commented 5 months ago

@rvagg Any thoughts on how to implement those periodic consistency checks ?. In my mind, can be as simple as

"My head is now epoch E -> so E - 900 is now final -> fetch the messages for it from the state store -> match it with what we have in the event Index -> raise alarm if mismatch". This can be a go-routine in the indexer itself.

Stebalien commented 5 months ago

Having the event subscription only query the database makes total sense to me. I.e., process chain events once, put them in the database, then trigger queries over the database to send events back to the user.
I want this to work with the native APIs with minimal hackiness, so I'm not a fan of "interposing" the events sub-system between subscription to, e.g., chain notify events and the client. IMO, the events subsystem still needs some way to "block" on a GetLogs call.

aarshkshah1992 commented 5 months ago

@Stebalien

"I want this to work with the native APIs with minimal hackiness, so I'm not a fan of "interposing" the events sub-system between subscription to, e.g., chain notify events and the client"

Wdym here ? Please can you elaborate a bit ? Which hackiness are you referring to ? I am saying that the native ETH RPC Event APIs should subscribe to Index DB stream to listen in for updates and forward them to the client (querying the DB if needed).

IMO, the events subsystem still needs some way to "block" on a GetLogs call.

With this design, this is just a matter of seeing an update event from the Index DB whose height is greater than the maxHeight requested by the GetLogs call.

rvagg commented 5 months ago

Checking at head-900 is what I was thinking, make big noise in the logs if there's a consistency problem. It's the kind of thing we could even remove in the future if it helps us build confidence.

AFDudley commented 5 months ago

But it is theoretically a nice option to not have more filesystem cruft if you just want latest info. Do we rule that out as an option? Or re-architect this flow such that it still does its thing through one place but has the option to not do it through an sqlite db?

We would use both options in production if they were available. We often times use two completely different code bases to track the head of a chain vs return historical results. We then use reverse proxies and some chain aware logic to route the requests. Similarly, supporting offline historical block processing is super useful.

rvagg commented 4 months ago

Dropping in some links of relevant docs so I have somewhere to record this:

https://docs.google.com/document/d/1GkXnmTx0wu7iNubjMc-r5b1MOIha-yRNf0z4GUnjftE - "Filecoin Lotus FEVM Node Setup Notes" for FEVM RPC providers with scale
https://lotus.filecoin.io/tutorials/lotus/backfill-messages/

Mostly interesting for the references to backfilling because I think the way we currently do it is a bit broken. lotus-shed backfills directly into the sqlite db file that is being used read/write by the node which lotus-shed has to get message data from to do the backfills. We need some way of ensuring a single writer to the db, either by building in backfilling into lotus itself; or having some switch you can flick to make this work; or documenting a process (turn indexing off while you do it, turn it on when you're done [which isn't even a great answer if you think thorough the racy nature of that]).

rvagg commented 3 months ago

Thoughts on what to do on startup to auto-backfill missing events:

Add a configuration option MaxBackfillEpochsCheck defaulting to 900 but could be any number if you want to make it go right back.
When lotus starts and the events index initialises, block writes to the events db until we've done initial reconciliation (would a mutex lock work for that? just hold up the async Apply? i don't think there's a reason to block reads for this is there?)
Walk from the last processed tipset to the current known chain head (likely using ChainGetPath to figure out the tipsets between) and process all of those tipsets, reverting as you walk back to the current canonical chain and applying as you move forward to the head. a. One complication here is that if the distance between the two tipsets is really large it could block for a long time, is that OK? b. Do we even need a write-lock? New writes from the chain should be idempotent I think, unless there are new reorgs that happen while we are trying to reconcile. c. Perhaps we move the chain observer until after this is all done, so we don't even need a write lock but only start the observer when we are ready for it. e. What do we need in the way of read locks? IsHeightPast, IsTipsetProcessed, GetMaxHeightInIndex are all a concern here and maybe we need them to return something different during initial reconciliation? Could just say no for both of the Is* calls and GetMaxHeightInIndex could return the height-1 of the oldest tipset in ChainGetPath?. f. We could punt on this and just accept correctness concerns are part of lotus startup and then fix this at a later date so as not to complicate initial work.
At this point, the head of the events db is reconciled and we can unlock writes for new tipsets
Work backward up to MaxBackfillEpochsCheck, filling in any holes we find checking whether we processed that tipset or not (in the events_seen table).

Some judicious logging would be good for this too, "took 15 minutes to check 100,00 (MaxBackfillEpochsCheck) epochs for events auto-backfill, backfilled 0 of them; consider adjusting MaxBackfillEpochsCheck to a smaller number." (e.g. if the user set it large and we are doing useless checking on each startup because they got it done the first time).

An alternative approach to achieve "when you first do this, go back a loooong way, but on subsequent startups only go back to finality" might be to have both a MaxEpochsCheck (small, default to finality) and MaxInitialBackfillEpochs (defaults to finality too but could be very large). Then when we first open the db, if it doesn't exist, we use MaxInitialBackfillEpochs to walk back. That way, you have a simple workflow: rm sqlite/events.db if you want to start from scratch and use MaxInitialBackfillEpochs to determine how far.

Also to watch out for: using splitstore you're going to run into the problem of not having state at some epoch, we have to handle the case where the Max* you asked for isn't possible to fill. This should probably just result in a warning log and not be fatal.

filecoin-project / lotus