filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.81k stars 1.24k forks source link

Refactoring event Indexing and simplifying the ETH events RPC flow (and fix a bunch of known issues with it) #12116

Open aarshkshah1992 opened 2 weeks ago

aarshkshah1992 commented 2 weeks ago

Checklist

Lotus component

What is the motivation behind this feature request? Is your feature request related to a problem? Please describe.

The current Chain Notify <> Event Index <> Event Filter Management <> ETH RPC Events API is racy, causes missed events, is hard to reason about and has known problems such as lack of automated backfilling, returning empty events for tipsets on the cannonical chain even though the tipset has events, not using the event Index as the source of truth etc. This issue aims to propose a new architecture/flow for event indexing and filtering to fix all of the above.

Describe the solution you'd like

I'd suggest the following refactor to fix all of the above and make this code easy to reason about

aarshkshah1992 commented 2 weeks ago

cc @Stebalien @rvagg @raulk

rvagg commented 2 weeks ago

Some initial thoughts:

rvagg commented 2 weeks ago

Oh, and further to that last point, we really do need GC on this thing if we're going to make it a necessary add-on. Maybe then it becomes much less an issue for people to turn on. If it GCs in time with splitstore and you only ever have less than a week's worth of events then there's less to be concerned about. I have a 33G events.db that's been collecting since nv22. Those who have been running it since FEVM must have much larger databases.

aarshkshah1992 commented 2 weeks ago

@rvagg

aarshkshah1992 commented 2 weeks ago

@rvagg Any thoughts on how to implement those periodic consistency checks ?. In my mind, can be as simple as

"My head is now epoch E -> so E - 900 is now final -> fetch the messages for it from the state store -> match it with what we have in the event Index -> raise alarm if mismatch". This can be a go-routine in the indexer itself.

Stebalien commented 2 weeks ago
  1. Having the event subscription only query the database makes total sense to me. I.e., process chain events once, put them in the database, then trigger queries over the database to send events back to the user.
  2. I want this to work with the native APIs with minimal hackiness, so I'm not a fan of "interposing" the events sub-system between subscription to, e.g., chain notify events and the client. IMO, the events subsystem still needs some way to "block" on a GetLogs call.
aarshkshah1992 commented 2 weeks ago

@Stebalien

"I want this to work with the native APIs with minimal hackiness, so I'm not a fan of "interposing" the events sub-system between subscription to, e.g., chain notify events and the client"

Wdym here ? Please can you elaborate a bit ? Which hackiness are you referring to ? I am saying that the native ETH RPC Event APIs should subscribe to Index DB stream to listen in for updates and forward them to the client (querying the DB if needed).

IMO, the events subsystem still needs some way to "block" on a GetLogs call.

With this design, this is just a matter of seeing an update event from the Index DB whose height is greater than the maxHeight requested by the GetLogs call.

rvagg commented 2 weeks ago

Checking at head-900 is what I was thinking, make big noise in the logs if there's a consistency problem. It's the kind of thing we could even remove in the future if it helps us build confidence.

AFDudley commented 2 weeks ago

But it is theoretically a nice option to not have more filesystem cruft if you just want latest info. Do we rule that out as an option? Or re-architect this flow such that it still does its thing through one place but has the option to not do it through an sqlite db?

We would use both options in production if they were available. We often times use two completely different code bases to track the head of a chain vs return historical results. We then use reverse proxies and some chain aware logic to route the requests. Similarly, supporting offline historical block processing is super useful.