Scalable Lotus Node Infrastructure with Separated Datastore

snissn commented 1 year ago

What is the motivation behind this feature request? Is your feature request related to a problem? Please describe.

An API provider, is using the Filecoin miner Lotus and is looking to scale up their infrastructure. Currently, a single Lotus node manages the Badger datastore and communicates with other Lotus nodes, which can become a bottleneck as the system grows.

Describe the solution you'd like

Proposed Solution

The proposal aims to split up a Lotus node into:

A node that manages the Badger datastore.
Multiple Lotus nodes that talk over the network to the badger datastore

To achieve this, we will fork and modify the go-ds-badger repository and rename it to go-ds-badger-client. This will change the datastore to be a TCP client that sends JSON commands to a new resource, the go-ds-badger-server, which will be on a separate machine and process requests. We anticipate being able to handle the system requirements, as a Lotus node is responsible for approximately 5 MB/s of disk writes.

The primary goal of this plan is to have many RPC nodes that share a data resource to scale up various types of RPC requests.

Uncertainties

There are some uncertainties about how writes will work in this new configuration:

Chain syncing

It is also unclear if this distributed solution can support multiple writers in this scenario. We need to explore whether we will need a single writer in this cluster responsible for writing, or if multiple lotus nodes will be able to coordinate together to maintain chain consensus.

RPC Requests

In the context of Lotus nodes, some RPC API requests are read-only, which means they do not require modifying the underlying datastore. Examples of such read-only requests include querying the blockchain state, retrieving transaction information, or checking account balances. Scaling up read-only requests is relatively straightforward, as the datastore remains unchanged, allowing for the deployment of multiple Lotus API nodes to handle increased read request loads.
- On the other hand, some RPC API requests, like sending transactions, require write access to the Badger datastore. These requests involve modifying the datastore, which can lead to challenges in synchronization and consistency when dealing with multiple Lotus nodes. A more complicated solution might be necessary to handle write operations while maintaining the integrity and performance of the system. This may involve implementing a single writer or exploring advanced techniques to support multiple writers in the cluster.

Despite these uncertainties, we remain optimistic that our proposed solution will effectively address the scalability challenges and pave the way for a more robust infrastructure for RPC hosting providers

ASCII Diagram Before:

+-----------------------+
|      Lotus Node       |
|  +-----------------+  |
|  | Badger Datastore|  |
|  +-----------------+  |
+-----------------------+

After:

+---------------------+     +---------------------+     +---------------------+
|  Lotus API Node 1   |     |  Lotus API Node 2   |     |  Lotus API Node N   |
+---------------------+     +---------------------+     +---------------------+

+----------------------+
| Lotus Writer Node(s)  |
+----------------------+ 

+------------------------+
|  go-badger-ds-server   |
+------------------------+

Conclusion

In conclusion, by implementing the proposed solution to separate the Badger datastore from the Lotus nodes and introducing multiple Lotus API nodes, we expect to significantly improve the scalability of our Lotus node infrastructure. This approach will allow us to efficiently scale up read-only RPC requests while maintaining performance and consistency. However, we acknowledge that addressing write operations, such as sending transactions and chain syncing, will require a more nuanced solution to ensure datastore synchronization and maintain system integrity. As we move forward, we remain optimistic about finding a suitable approach for handling write operations and appreciate any feedback or suggestions that can help address these challenges.

ZenGround0 commented 1 year ago

At first glance I'm concerned about performance. IMO it will be important to get really early data on 1) read and write speed 2) small chain sync job to make sure that performance is acceptable. One anecdote behind this is that fil-infra is looking into moving from snapshot over lotus api to snapshot directly to disk for performance reasons. I don't see why good performance isn't possible in principle but there may be a few hidden roadblocks.

One obvious and nice benefit to separation is a removal of compute contention between splitstore or other possible disk management approaches and chain sync, which is currently hypothesized to be a problem. This separation further opens up the solution space of disk management approaches. For example instead of moving GC swapping datastores on the same machine we could stream keys to a new badger server and then swap rpc calls out.

Something you've probably already thought of but worth bringing up -- when dealing with IPLD data all k,v pairs are immutable. This presents at least a theoretical opportunity for parallel writes. Our use of badger is almost entirely IPLD and it could be the case that you separate out only the IPLD data into the remote badger storage (leaving /meta blockstore directly in lotus). Then the main contention problem from multiple writers is bloating of badger, potentially not a problem we even want to solve with well configured splitstore running moving GC, or potentially something you can solve well enough with a layer of smart caching filtering out duplicates above badger itself.

ZenGround0 commented 1 year ago

A little more on performance. My worries might be mostly snapshot related which could make sense with your existing estimations since data throughput is probably a lot higher on those jobs. Worth thinking through all of the read/write intensive jobs when thinking through performance. I can think of

online chain sync
catch up chain sync
snapshot generation
snapshot import
Splitstore compaction / moving GC -- very likely you would want these on the badger server process but if not they'll need to be addressed
Big one -- migrations. Right now a sector info (~full state) migration hits ~70 MiB / s on my probably-not-that-great machine and this might already be a bottleneck for SPs trying to get past migrations quickly

Stebalien commented 1 year ago

The main issue isn't the duplicating data, it's having every node maintain sync while also serving requests. Unfortunately, sharing a common blockstore will likely introduce more bottlenecks (the shared store) than it'll solve.

Instead, we need a leader/follower architecture where multiple lotus nodes can "follow" a leader node. It should be relatively easy to replace the lotus sync service with one that follows another node with ChainNotify. Additionally, we need to share state/indices:

Allow all nodes to access a shared datastore backing the chainstore and statestore. Followers should only read from this store while the leader should write to it.
Support postgresql and/or other SQL servers for indexing (events, etc.) so multiple "follower" nodes can share the same index without having to re-index and/or re-execute anything.

In terms of safety:

"Local" state is written to the metadata store, which doesn't need to be shared.
Writing to a shared blockstore is always safe, but I don't think that's necessary in this case. We can layer blockstores and let "followers" have temporary blockstores which can be frequently discarded.

The tricky part will be the "head" tipset. That won't get executed by the "leader" until the next epoch. We have two options:

Provide some way to forbid querying the "latest" tipset in the API. Not sure how much this would break, but it would definitely be nice...
Have followers ask the leader to execute this tipset for them. This works, but is a more invasive change.

This is half-way between a "light" node and a "full" node:

It's light in that it doesn't actually sync the chain.
It's "full" in that it does everything but sync the chain.

snissn commented 1 year ago

this has been deprecated for https://github.com/filecoin-project/lotus/issues/10630

filecoin-project / lotus