feat: make it easy to sync hubs with databases

varunsrin commented 1 year ago

Synchronizing a hub's state with a database is non-trivial (see keeping up with the hubs), and most developers do not want to spend all this time just managing state.

What is the fastest way we could get developers to having a live, usable database?

varunsrin commented 1 year ago

Talked to @davidfurlong (Discove) about this today.

One approach is to write a node.js service that syncs a hub with a postgres db applying all the rules to keep a table or set of tables in sync. It's a somewhat rigid approach and requires that the database have specific schemas, but it will solve a lot of pain points for developers and help them bootstrap faster. We could also set up a docker configuration that sidecars this service with a hub, so someone can just spin up something in the cloud on AWS or Render or Railway that does this automatically.

This is somewhat narrow, and there might be a better solution that:

Allows devs more flexibility with defining their own schemas.
Allows devs to plug in other databases.
Doesn't require Node.js as a dependency.

deodad commented 1 year ago

A few thoughts on what makes syncing hard:

Mutable state makes polling hard

Since state can change (i.e. prunes, revokes, deletes), an application that has some partial state it wants to synchronize with hubs will need to 1) poll for any messages it doesn’t know about, 2) check all the messages it does know about to see if they are still in hub state.

Hub specific events make subscribing hard

Since events emitted by hubs are specific to that hub’s state, an application can’t easily subscribe to multiple hubs or change the hub it’s subscribed to without re-implementing merge logic.

Compare this to subscribing to Ethereum events where order is guaranteed and an application has a global high watermark for its state.

Volatile hub state makes implementing side effects hard

Hub states have a complex synchronization process that enables them to reach eventually consistency. Messages are able to be added and removed any number of times and hubs don’t distinguish between the first add and subsequent adds.

Pruning. Pruning is non-deterministic across hubs, thrash during this process can cause a messages to get merged and pruned multiple times.

Revocation and re-signing. Many messages are revoked and then merged again later with a different signer.

Backfilling. An application like Warpcast may want to re-submit a large number of old messages to hubs. This will result in many messages getting merged and pruned.

One way to help support 1 and 2 with the approach proposed is to add an indexer level event emitter. This event emitter would describe changes to the indexed state, rather than hub events.

Internally the indexer would use a combination of polling and subscription to synchronize with one or more hubs. The consumer of these events would have a single API to learn about all indexed state changes.

It could make sense to be explicit that these events are state changes to a derived set of data models rather than hub events directly:

castCreated | castUpdated | castDiscarded | castUndiscarded
likeCreated | likeUpdated | likeDiscarded | likeUndiscarded

Consumers could subscribe to these events and easily derive data into the stores and schemas of their choosing.

TimDaub commented 1 year ago

How hard is it to generalize rocksDB? E.g. write a general interface adapter for any database and then implement a driver for rocks and another one for Postgres?

sds commented 1 year ago

@deodad provided an excellent summary of the challenges anyone trying to interpret arbitrarily ordered delta graph messages into a consistent view, while also executing one-time side effects (e.g. sending push notifications) in a sensible way. We'll expand on this in our proposal we'll be sharing sometime this week.

How hard is it to generalize rocksDB? E.g. write a general interface adapter for any database and then implement a driver for rocks and another one for Postgres?

@TimDaub: after chatting with the team, we'd like to avoid introducing a generalized adapter interface for hubs themselves, as conceptually rocksDB is very different from a persistence layer like Postgres since it is intended to be executed by code in the process itself, whereas Postgres you are interacting with a remote process via the Postgres wire protocol. We hope to provide more detail in a proposal we'll be sharing sometime this week.

sds commented 1 year ago

We want to provide a solution that makes it easier for developers to get started creating applications that read data from the Farcaster network using tools they are likely more familiar with, e.g. relational data stores.

Proposal

At a high level, we want to ship an application (hub-indexer) which replicates data from a single hub into a persistent store of the user's choice (Postgres to start).

Invocation

The utility can be used by installing via NPM:

npm install @farcaster/hub-indexer

…and then running with the appropriate configuration:

export HUB_INDEXER_HUB_HOST=hoyt.farcaster.xyz:2283
export HUB_INDEXER_POSTGRES_URL=postgresql://username:password@hostname:5432/db_name
hub-indexer

We can also ship a Docker container to simplify installation further, allowing you to skip any installation and just run:

# .env file contains environment variables above
docker run --rm --env-file .env farcasterxyz/hub-indexer

Shipping a Docker image also allows users to deploy directly in their orchestrator of choice with very little effort, e.g. AWS Fargate, etc.

Configuration

The hub-indexer application will obtain configuration from the following sources. Last item wins in the event an option is specified in multiple locations:

.hub-indexer.yaml (or whichever configuration file is provided via --config flag)
.env file in the current working directory
Environment variables
CLI flags

Testing

We’ll run integration tests using a hub setup with some fake data submitted to it, and verifying that data is created in the relational store. This also ensures that as hubs change we ensure hub-indexer continues to work as expected.

Extending to other persistent stores

Default implementations for other persistent stores can be submitted via pull requests to the repository. By using Kysely, we’ll have out-of-the-box support for Postgres, MySQL, and SQLite. Kysely also has third-party support for other stores such as PlanetScale, SurrealDB, Cloudflare D1, etc.

Replicated data

If using a relational data store like Postgres (which will be the implementation that will ship to start), hub-indexer will simply create a single messages table to start with the following columns

Column Name	Data Type	Description
fid	`bigint`	FID of the user that signed the message.
type	`smallint`	Message type.
timestamp	`timestamp with time zone`	Message timestamp in UTC.
hash	`bytea`	Message hash.
hash_scheme	`smallint`	Message hash scheme.
signature	`bytea`	Message signature.
signer	`bytea`	Signer used to sign this message.
protobuf	`bytea`	Raw bytes representing the serialized message protobuf.
deleted_at	`timestamp with time zone`	When the message was deleted by the hub (e.g. in response to a `CastRemove` message, etc.)
pruned_at	`timestamp with time zone`	When the message was pruned by the hub.
revoked_at	`timestamp with time zone`	When the message was revoked by the hub due to revocation of the signer that signed the message.

FAQ

Why ship a separate hub-indexer application rather than include this logic in the hubs themselves? This provides greater flexibility. By having this code in hubs, you would need to run a hub in order to get data. With a separate application, you can point at any hub to get data. Whether you use your own hub or an external one is up to you.
What happens as the schema evolves over time? hub-indexer will automatically handle migrating a database schema when upgraded.

psatyajeet commented 1 year ago

This definitely would be useful - I use @gskril farcaster-indexer heavily to pull extra metadata (hashtags) from each post and surface them.

Would hub-indexer replace that?

manan19 commented 1 year ago

@sds Love most of what you've proposed for hub-indexer.

Couple of thoughts

Would messages be the only table getting indexed? Ideally as application devs, I'd want the message stream to construct tables that are directly usable for our apps. e.g. a profiles table compiles all the messages and just tells me what the user's profile picture is right now vs. knowing the history of changes that happened for a user's profile picture. There could be several ways developers might want to index the incoming messages, so I propose having some sort of a plugin system on top of hub-indexer that allows developers to create packages that can read off a message stream & index data in a specific way. Plugins might also allow users to combine information from different sources (e.g. on-chain information like ens look ups for a connected address etc.) and add that to a specific table managed by that plugin.
Personally, I care less about how many other databases does hub-indexer supports vs. how well can it index the data from the hubs into Postgres and making it extensible

gskril commented 1 year ago

I love the focus on making Farcaster data easier to work with but I'm not sure this gets us quite there.

hub-indexer will simply create a single messages table to start with the following columns

I've been working on migrating my indexer repo from Warpcast APIs to Hubs (https://github.com/gskril/farcaster-indexer/pull/15) and personally, all of the convenience comes from having multiple tables with already decoded data. I have the tables casts, profile, reaction, signer, and verification which are ready to read immediately after seeding the db.

A single table of hub messages with protobufs might cut out a few steps for me, but I don't think devs would be able to use this out of the box. At least I wouldn't use it like that myself. (which isn't necessarily bad! just not quite what I expected when reading the title and purpose of the proposal)

davidfurlong commented 1 year ago

Came here to say the same thing as greg! messages is insufficient - want to have as "ready to go" tables as possible to build any app on top of. That means formatting data in a nice way for consumption by backends - tables for casts, profiles, reactions and others.

It would also be nice to have easy to use views, functions or a library to automatically and performantly generate text strings from cast text + mentions + mentionPositions

vernonjohnson commented 1 year ago

This is step in right direction. Think what's important is that the indexer/etl service is modular and easily extendable. It should be straight forward to add additional dbs. You could have: 1) Some general datastore interface that abstracts away implementation details for the db (e.g., connect, addCast, removeCast) 2) To add new store, just need to provide implementation for target db that abides by interface

Other thoughts: 1) Agree with @davidfurlong and @gskril that indexer should provide ready to use tables instead of just messages. 2) Graphql api for querying indexed data would be nice 3) Dbs to support: postgres, mysql, mongo, elasticsearch

Happy to help with dev work on this.

matallo commented 1 year ago

I appreciate the thought and effort to making hubs more accessible. I'll paste part of the feedback I sent to @varunsrin in Farcaster:

I know the priority is on the hubs and the examples and README on GitHub look great, but would love to see an API reference of the Warpcast API ...

At the end of the day developers are used to consuming APIs and while this will lower the barrier to start building on top of Farcaster I tend to agree with others in this thread that the closest to that desired state is preferable: by having multiple tables for the different entities, or even going one extra step and providing an API to that data.

I understand the reasoning of running hubs and of course this can be too demanding to the Farcaster core team to open source the infrastructure needed to run an API. Maybe this is expected to be done by the developer community of Farcaster but at the same time I see it as the biggest barrier of entry.

AMAN-BARBARIA commented 1 year ago

I see 2 use cases here:

Use this as interim table for apps who want to build their own custom table for cast, reactions etc. This will be helpful to abstract the complexity of reading from hubs and directly consume updates from this table. The above schema can have an updated_at column. It will help to fetch & sync the recently modified messages.
Having a database that can be directly consumed by developers to build apps on top of it, as everyone has been pointing out. This could be another set of tables built using the above table.

blingblingdev commented 1 year ago

This feature proposal is amazing! It will certainly make using hubs easier. However, I prefer message queues like Kafka or RabbitMQ over other databases. There are several reasons for this:

I believe that hub messages are more event-driven, and traditional databases may not provide real-time data and may only be suitable for offline usage or have a delay.
With so many different databases available, such as PostgreSQL, MySQL, MongoDB, Redis, etc., it can be challenging to meet everyone's needs. A message queue would serve as a perfect middleware as developers can save data to any other database they prefer to use.
I was wondering if Hub is now using a nodejs-only protobuf library? If so, this will be an opportunity to make it easier for other programming languages to interact with the hub.
When compared to traditional databases, saving data to a message queue may have a lighter cost in terms of CPU and memory usage.
Even when traditional databases are supported, message queues may still be preferred for more advanced purposes.

I'm happy to contribute to this feature proposal.

AMAN-BARBARIA commented 1 year ago

Should we also add support for Non Relational databases. eg: MongoDB

Having support for non relational database should be helpful and it will also allow developers to choose the right database for their application and make their development process more efficient

davidfurlong commented 1 year ago

I think the solution here is an event driven adapter for hubs that is similar to the current watch events but instead of being raw hub events, it calls with events like { type: 'delete', table: 'casts', data: ... } and { type: 'insert', table: 'casts', data: ... }. Then the community can create it's own libraries to map these into different databases or event streams.

The problem this solves is that hub events, particularly ones around signers being invalidated are not a particularly entry-level developer friendly abstraction, and create a bunch of work for developers to figure out which objects need to be deleted. This library would essentially map the event data storage of hubs to a state oriented data structure, irrespective of desired storage for these objects.

This is similar to ethereum ETL/SQL infra tools that have a table of raw transactions, but then also provide tables around current NFT's held by wallets that are much easier to work with.

sds commented 1 year ago

Thanks everyone for your feedback! It’s great to validate that such tooling would be useful to the developer community.

The high-level themes from the feedback:

Derived data in the form of casts, reactions, etc. is more useful than a generic messages table.
Support for any data store (SQL/NoSQL/etc.) allows developers to use the technology they are proficient with.
Higher-level abstractions (like what the Warpcast API offers) can lower the barrier to entry.

With those themes outlined, let’s enumerate some considerations:

Developers want their lives made easier sooner rather than later. A simple solution that helps most developers get off the ground quickly is better than one that solves every use case quickly.
Any solution must be maintained over time as the protocol evolves, especially as developers come to depend on it. Minimizing maintenance cost ensures Farcaster continues to push forward.
Developers understand their own application needs best, but have a wide variety of experience with distributed systems. A solution should meet developers where they are.

With this in mind, we've been discussing as a team and are working on an updated proposal. Feel free to continue to leave comments on this issue, but otherwise stay tuned as we'll share something this coming week.

Thanks!

sds commented 1 year ago

Hey all, after further discussion within the team, we came to the conclusion that it would be difficult to offer a "one-size-fits-all" solution to this problem in a timeframe that would be useful to developers. We opted to build a working example of replicating data from hubs into a relational database (Postgres) in #938.

Building a working example with real code serves multiple purposes:

It allows us to provide a working example now, rather than later.
Developers can check out code locally and start syncing data with a single command. This "feels" better than asking developers to import a library and take extra steps to integrate with their applications before they've even had a chance to see what the data might look like.
It lowers the barrier to entry for building prototypes. With working code ingesting mainnet data to their local machine, developers can quickly try out ideas with real data using an ecosystem and tooling they are familiar with.
It doesn't prescribe a single solution to how to model/structure data, but it provides an example that illustrates how you could model your data, including how to think about side effects (see the TODO comments in the example).

The reality is that every data store is different, and appropriately modeling activity on the Farcaster network requires an understanding of your application needs and respective data store.

We would love for developers to copy+paste and adapt the example for other data stores. Feel free to submit any feedback here, or open PRs directly against the example. Thanks!

farcasterxyz / hub-monorepo