Refactor indexing pipeline

avichalp commented 8 months ago

Before the hub can listen or send Farcaster messages, the Hub must index the required EVM events. It is necessary because new FIDs and signers are created onchain. The hub cannot validate and store protocol messages until we have the user identity data, such as FIDs, signers, and storage capacity.

The current approach is to index events from the known contracts before starting the hub. Once the older events are processed, we start the hub and listen to messages on the p2p network. We process the new events in a background thread as they arrive.

Here are a couple of proposals for improving this approach:

Currently, we are calling eth_getLogs serially. It is hard to parallelize because of the dependencies among various events. For instance, you cannot process the "remove signer" event without first processing the "add signer" event for the same signer. A similar issue exists for FIDs. Here is an alternative approach that we could adopt. We could first download all the events from start_block to latest_block and store them in the database. This could happen concurrently. Then, we could sort the downloaded events (by block number) and process them locally to populate the FID and Signer tables.
We could make the indexing pipeline more generic by making it work with different databases. This is not relevant to the Teleport project per se because Teleport only works with SQLite (which feels like the correct choice so far). However, I noticed that the Indexer only needs certain APIs to index all the required Farcaster identity data into a database. For example, any database, be it SQLite, Postgres Redis, etc., if it implements specific trait(s) (for instance, public API could be StoreEvent, StoreFID, StoreSigner, RemoveFID, RemoveSigner, etc.) can be used to index all relevant events. Perhaps we can make the Indexer a standalone project within OpenFarcaster. Teleport can use it by implementing the Storage trait for SQLite and passing in the SQLiteStorage struct to the Indexer.

haardikk21 commented 8 months ago

I think for historical indexing this makes sense, but for real-time indexing in the background im not sure if the overhead of download first, sort and process later makes sense. I also wonder if this problem can be solved by just doing snapshot syncs instead. I know Hubble supports snapshot syncs and has a certain format, we can explore being compatible with that or having our own snapshot format if we're not compatible. In that case, I presume most people will use snapshots and the historical sync issue is largely solved by that.
Does this become much harder to do later? I'm not sure if this is the best use of time right now even though it makes sense but isn't immediately necessary. If you think this will become 10x harder over time it may be worth considering.

Another question here is I believe FC migrated to a new contract due to an issue with the older ones

I'm not 100% sure if that was a new contract address entirely or if the contract is upgradeable -> we should check and make sure we're not missing events prior to the migration

avichalp commented 8 months ago

I also wonder if this problem can be solved by just doing snapshot syncs instead.

I think this is precisely what we need here. Snapshot sync would be ideal. Do you know where we can get the latest snapshot?

Does this become much harder to do later?

No, I think we can move it out later. I just wanted to call out because I thought it is a good opportunity to abstract out some parts that are more generic. I will create a different issue to track this.

Another question here is I believe FC migrated to a new contract due to an issue with the older ones

Yes, they have migrated a couple of times to new contract addresses. But all farcaster contract migrations re-emits the events for older FIDs and signers. It seems that it is sufficient to index only the latest contract.

For instance, after indexing the current contract, if I check the first 10 FIDs, I get FIDs 1 through 10, which is what I expected.

select * from fids order by fid limit 10;

avichalp commented 8 months ago

Looked into snapshot syncs. Here is how you can download the the snapshots:

Get the S3 key for the latest snapshot by calling this API: https://download.farcaster.xyz/snapshots/MAINNET/DB_SCHEMA_7/latest.json
Download the actual snapshot from: https://download.farcaster.xyz/

The problem here is that these snapshots are rocks db dumps. It means we need to restore the rocks db in memory, extract the data (eg: casts, links etc.) and store it in the our SQLite db into correct tables. Creating this pipeline seems like a big enough task in itself so I will create a new issue to discuss that.

gregfromstl commented 7 months ago

Will be partially completed with #14 (pending one final change)

haardikk21 commented 6 months ago

Going to close this since it's taken care of.

OpenFarcaster / teleport

Refactor indexing pipeline #8