Closed varunsrin closed 1 year ago
Talked to @davidfurlong (Discove) about this today.
One approach is to write a node.js service that syncs a hub with a postgres db applying all the rules to keep a table or set of tables in sync. It's a somewhat rigid approach and requires that the database have specific schemas, but it will solve a lot of pain points for developers and help them bootstrap faster. We could also set up a docker configuration that sidecars this service with a hub, so someone can just spin up something in the cloud on AWS or Render or Railway that does this automatically.
This is somewhat narrow, and there might be a better solution that:
A few thoughts on what makes syncing hard:
Since state can change (i.e. prunes, revokes, deletes), an application that has some partial state it wants to synchronize with hubs will need to 1) poll for any messages it doesn’t know about, 2) check all the messages it does know about to see if they are still in hub state.
Since events emitted by hubs are specific to that hub’s state, an application can’t easily subscribe to multiple hubs or change the hub it’s subscribed to without re-implementing merge logic.
Compare this to subscribing to Ethereum events where order is guaranteed and an application has a global high watermark for its state.
Hub states have a complex synchronization process that enables them to reach eventually consistency. Messages are able to be added and removed any number of times and hubs don’t distinguish between the first add and subsequent adds.
Pruning. Pruning is non-deterministic across hubs, thrash during this process can cause a messages to get merged and pruned multiple times.
Revocation and re-signing. Many messages are revoked and then merged again later with a different signer.
Backfilling. An application like Warpcast may want to re-submit a large number of old messages to hubs. This will result in many messages getting merged and pruned.
One way to help support 1 and 2 with the approach proposed is to add an indexer level event emitter. This event emitter would describe changes to the indexed state, rather than hub events.
Internally the indexer would use a combination of polling and subscription to synchronize with one or more hubs. The consumer of these events would have a single API to learn about all indexed state changes.
It could make sense to be explicit that these events are state changes to a derived set of data models rather than hub events directly:
castCreated | castUpdated | castDiscarded | castUndiscarded
likeCreated | likeUpdated | likeDiscarded | likeUndiscarded
Consumers could subscribe to these events and easily derive data into the stores and schemas of their choosing.
How hard is it to generalize rocksDB? E.g. write a general interface adapter for any database and then implement a driver for rocks and another one for Postgres?
@deodad provided an excellent summary of the challenges anyone trying to interpret arbitrarily ordered delta graph messages into a consistent view, while also executing one-time side effects (e.g. sending push notifications) in a sensible way. We'll expand on this in our proposal we'll be sharing sometime this week.
How hard is it to generalize rocksDB? E.g. write a general interface adapter for any database and then implement a driver for rocks and another one for Postgres?
@TimDaub: after chatting with the team, we'd like to avoid introducing a generalized adapter interface for hubs themselves, as conceptually rocksDB is very different from a persistence layer like Postgres since it is intended to be executed by code in the process itself, whereas Postgres you are interacting with a remote process via the Postgres wire protocol. We hope to provide more detail in a proposal we'll be sharing sometime this week.
We want to provide a solution that makes it easier for developers to get started creating applications that read data from the Farcaster network using tools they are likely more familiar with, e.g. relational data stores.
At a high level, we want to ship an application (hub-indexer
) which replicates data from a single hub into a persistent store of the user's choice (Postgres to start).
The utility can be used by installing via NPM:
npm install @farcaster/hub-indexer
…and then running with the appropriate configuration:
export HUB_INDEXER_HUB_HOST=hoyt.farcaster.xyz:2283
export HUB_INDEXER_POSTGRES_URL=postgresql://username:password@hostname:5432/db_name
hub-indexer
We can also ship a Docker container to simplify installation further, allowing you to skip any installation and just run:
# .env file contains environment variables above
docker run --rm --env-file .env farcasterxyz/hub-indexer
Shipping a Docker image also allows users to deploy directly in their orchestrator of choice with very little effort, e.g. AWS Fargate, etc.
The hub-indexer
application will obtain configuration from the following sources. Last item wins in the event an option is specified in multiple locations:
.hub-indexer.yaml
(or whichever configuration file is provided via --config
flag).env
file in the current working directoryWe’ll run integration tests using a hub setup with some fake data submitted to it, and verifying that data is created in the relational store. This also ensures that as hubs change we ensure hub-indexer
continues to work as expected.
Default implementations for other persistent stores can be submitted via pull requests to the repository. By using Kysely, we’ll have out-of-the-box support for Postgres, MySQL, and SQLite. Kysely also has third-party support for other stores such as PlanetScale, SurrealDB, Cloudflare D1, etc.
If using a relational data store like Postgres (which will be the implementation that will ship to start), hub-indexer
will simply create a single messages
table to start with the following columns
Column Name | Data Type | Description |
---|---|---|
fid | bigint |
FID of the user that signed the message. |
type | smallint |
Message type. |
timestamp | timestamp with time zone |
Message timestamp in UTC. |
hash | bytea |
Message hash. |
hash_scheme | smallint |
Message hash scheme. |
signature | bytea |
Message signature. |
signer | bytea |
Signer used to sign this message. |
protobuf | bytea |
Raw bytes representing the serialized message protobuf. |
deleted_at | timestamp with time zone |
When the message was deleted by the hub (e.g. in response to a CastRemove message, etc.) |
pruned_at | timestamp with time zone |
When the message was pruned by the hub. |
revoked_at | timestamp with time zone |
When the message was revoked by the hub due to revocation of the signer that signed the message. |
Why ship a separate hub-indexer
application rather than include this logic in the hubs themselves?
This provides greater flexibility. By having this code in hubs, you would need to run a hub in order to get data. With a separate application, you can point at any hub to get data. Whether you use your own hub or an external one is up to you.
What happens as the schema evolves over time?
hub-indexer
will automatically handle migrating a database schema when upgraded.
This definitely would be useful - I use @gskril farcaster-indexer heavily to pull extra metadata (hashtags) from each post and surface them.
Would hub-indexer
replace that?
@sds Love most of what you've proposed for hub-indexer
.
Couple of thoughts
Would messages
be the only table getting indexed? Ideally as application devs, I'd want the message stream to construct tables that are directly usable for our apps. e.g. a profiles
table compiles all the messages and just tells me what the user's profile picture is right now vs. knowing the history of changes that happened for a user's profile picture.
There could be several ways developers might want to index the incoming messages, so I propose having some sort of a plugin system on top of hub-indexer
that allows developers to create packages that can read off a message stream & index data in a specific way. Plugins might also allow users to combine information from different sources (e.g. on-chain information like ens look ups for a connected address etc.) and add that to a specific table managed by that plugin.
Personally, I care less about how many other databases does hub-indexer
supports vs. how well can it index the data from the hubs into Postgres and making it extensible
I love the focus on making Farcaster data easier to work with but I'm not sure this gets us quite there.
hub-indexer
will simply create a singlemessages
table to start with the following columns
I've been working on migrating my indexer repo from Warpcast APIs to Hubs (https://github.com/gskril/farcaster-indexer/pull/15) and personally, all of the convenience comes from having multiple tables with already decoded data. I have the tables casts
, profile
, reaction
, signer
, and verification
which are ready to read immediately after seeding the db.
A single table of hub messages with protobufs might cut out a few steps for me, but I don't think devs would be able to use this out of the box. At least I wouldn't use it like that myself. (which isn't necessarily bad! just not quite what I expected when reading the title and purpose of the proposal)
Came here to say the same thing as greg! messages
is insufficient - want to have as "ready to go" tables as possible to build any app on top of. That means formatting data in a nice way for consumption by backends - tables for casts, profiles, reactions and others.
It would also be nice to have easy to use views, functions or a library to automatically and performantly generate text strings from cast text
+ mentions
+ mentionPositions
This is step in right direction. Think what's important is that the indexer/etl service is modular and easily extendable. It should be straight forward to add additional dbs. You could have: 1) Some general datastore interface that abstracts away implementation details for the db (e.g., connect, addCast, removeCast) 2) To add new store, just need to provide implementation for target db that abides by interface
Other thoughts: 1) Agree with @davidfurlong and @gskril that indexer should provide ready to use tables instead of just messages. 2) Graphql api for querying indexed data would be nice 3) Dbs to support: postgres, mysql, mongo, elasticsearch
Happy to help with dev work on this.
I appreciate the thought and effort to making hubs more accessible. I'll paste part of the feedback I sent to @varunsrin in Farcaster:
I know the priority is on the hubs and the examples and README on GitHub look great, but would love to see an API reference of the Warpcast API ...
At the end of the day developers are used to consuming APIs and while this will lower the barrier to start building on top of Farcaster I tend to agree with others in this thread that the closest to that desired state is preferable: by having multiple tables for the different entities, or even going one extra step and providing an API to that data.
I understand the reasoning of running hubs and of course this can be too demanding to the Farcaster core team to open source the infrastructure needed to run an API. Maybe this is expected to be done by the developer community of Farcaster but at the same time I see it as the biggest barrier of entry.
I see 2 use cases here:
This feature proposal is amazing! It will certainly make using hubs easier. However, I prefer message queues like Kafka or RabbitMQ over other databases. There are several reasons for this:
I'm happy to contribute to this feature proposal.
Should we also add support for Non Relational databases. eg: MongoDB
Having support for non relational database should be helpful and it will also allow developers to choose the right database for their application and make their development process more efficient
I think the solution here is an event driven adapter for hubs that is similar to the current watch events but instead of being raw hub events, it calls with events like { type: 'delete', table: 'casts', data: ... }
and { type: 'insert', table: 'casts', data: ... }
. Then the community can create it's own libraries to map these into different databases or event streams.
The problem this solves is that hub events, particularly ones around signers being invalidated are not a particularly entry-level developer friendly abstraction, and create a bunch of work for developers to figure out which objects need to be deleted. This library would essentially map the event data storage of hubs to a state oriented data structure, irrespective of desired storage for these objects.
This is similar to ethereum ETL/SQL infra tools that have a table of raw transactions, but then also provide tables around current NFT's held by wallets that are much easier to work with.
Thanks everyone for your feedback! It’s great to validate that such tooling would be useful to the developer community.
The high-level themes from the feedback:
casts
, reactions
, etc. is more useful than a generic messages
table.With those themes outlined, let’s enumerate some considerations:
With this in mind, we've been discussing as a team and are working on an updated proposal. Feel free to continue to leave comments on this issue, but otherwise stay tuned as we'll share something this coming week.
Thanks!
Hey all, after further discussion within the team, we came to the conclusion that it would be difficult to offer a "one-size-fits-all" solution to this problem in a timeframe that would be useful to developers. We opted to build a working example of replicating data from hubs into a relational database (Postgres) in #938.
Building a working example with real code serves multiple purposes:
TODO
comments in the example).The reality is that every data store is different, and appropriately modeling activity on the Farcaster network requires an understanding of your application needs and respective data store.
We would love for developers to copy+paste and adapt the example for other data stores. Feel free to submit any feedback here, or open PRs directly against the example. Thanks!
Synchronizing a hub's state with a database is non-trivial (see keeping up with the hubs), and most developers do not want to spend all this time just managing state.
What is the fastest way we could get developers to having a live, usable database?