filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.82k stars 1.25k forks source link

Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

Open aarshkshah1992 opened 1 month ago

aarshkshah1992 commented 1 month ago

This issue aims to be a meta-issue to capture and track work that needs to be done to enhance correctness, performance, and stability of the ETH RPC API on snapshot synced nodes. Note that improving performance for ETH RPC API on archival nodes is out of scope for this issue and will be addressed by a future issue.

Our goal is to improve the developer experience (DX) for key partners, including:

  1. ETH Subgraph Providers (e.g., Vulcanize)
  2. EVM Explorers (e.g., Blockscout)
  3. Cross-chain Bridges (e.g., Axelar)
  4. DApp Developers using existing ETH tooling which relies on ETH RPC API

Correctness and data availability issues in the chain state Indexes used by the ETH RPC API

Currently, we maintain three primary indices on the chain state, which are essential for both correctness and performance of multiple ETH RPC APIs.

Transaction Index

Message Index

Event Index

All of the above indices suffer from some or all of the following problems that need to be fixed:

Correctness problems in the ETH Events API

In-memory block caching for perf improvements

Multiple ETH RPC APIs frequently need to lookup Filecoin Tipsets and convert them to the correspondong Ethereum block representations. These lookups are performed on the chainstore which is expensive. We should cache these tipsets/blocks in an LRU cache. See #10520.

Miscellaneous correctness bugs from the backlog

snissn commented 1 month ago

For any and all of these that have ways to reproduce it would be great to coordinate and add the RPC call behind any of these issues to the RPC benchmark tool that Fil-B has been maintaining - https://github.com/fil-builders/benchmark-rpc/blob/main/pages/index.js#L21

For example this issue https://github.com/filecoin-project/lotus/issues/10940 should be easy to reproduce in a live test. I created a ticket for it https://github.com/FIL-Builders/benchmark-rpc/issues/1 to add it to the web app http://benchmark-rpc.fil.builders/

rvagg commented 1 month ago

They're not reset and rehydrated when a node syncs from a snapshot

msgindex is @ https://github.com/filecoin-project/lotus/blob/718fc0330f313a37d5deb73e6e05f0fac2c9b772/cmd/lotus/daemon.go#L647

It at least has a pattern we can follow for others. But it also overlaps with a backfill operation, so we may end up taking care of snapshot import with a general backfill routine if we get that right.

Stebalien commented 1 month ago

This list seems pretty complete. IMO, the highest priority is fixing the indexing issues:

  1. P0: Make sure that we backfill to the last indexed tipset on restart. We should never have "holes".
  2. P1: Make sure that we wait for the index in some cases, or even eagerly force it. E.g., as we discussed on the call, all of the EthGetBlockBy* commands should force the indexer to index that block (and its parents).

Indexing is done asynchronously to tipset/message execution but APIs that rely on these indices do not account for the async nature of indexing which leads to racy data avability issues for lookups at/near the chain head

IMO, the best way to handle this case is the dance we discussed on the call:

  1. Check the index. If present, return.
  2. Wait for the index to reach the current head.
  3. Check again.

This will miss uncles, but StateSearchMsg is designed to only find messages on the main chain.


I took a look at how geth handles stuff like this and... they also appear to index asynchronously and handle this case by returning an error if the node is currently indexing a block. That's not a terrible option... but it would be a larger breaking change.

BigLep commented 1 month ago

This is a great overview @aarshkshah1992 - thanks for writing it up. A few questions, some of which are coming from a newbie/ignorant-of-the-code perspective. I'm happy to chat on any of these elsewhere or offline, but figured to ask here so it's public.

  1. Can any of the the ETH RPC code simplify when we have fast finality (F3)? If so, given we’re so close to that being live on the network (less than 2 months), should we do work here with that assumption in mind?
  2. Assuming we "make sure that we backfill to the last indexed tipset on restart so we never have holes" and we "make sure that we wait for the index" can we simplify the code failing fast if we have an index miss? I'm eyeing your statements in the issue description about “the subsequent lookup for the message will be slow and likely fail if the query pertains to an ETH transaction” and “A miss on this Index leads to a painfully slow search through the state store for the requested message”. It's presumably not good to be in this state, but would it better off to fail fast then do expensive state traversals? (I assume private Lotus RPC methods can be invoked if someone needs the slow path.)
  3. Is there value in having separate db's? (I don't have insight here - just curious.). I see there is an issue about consolidating in https://github.com/filecoin-project/lotus/issues/10807 . (Maybe it's best to discuss this question there.)
aarshkshah1992 commented 1 month ago

@BigLep

  1. I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.
  2. I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.

Let me look into 2.

Stebalien commented 1 month ago

I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.

That depends on what we do with the API. If F3 is "fast enough", we could just not expose anything after finality. But... that's probably not going to work well.

I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.

I agree there's no reason to keep them separate.

aarshkshah1992 commented 1 month ago

@Stebalien @BigLep @rvagg

In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.

Let's get to it once we've fixed all the other problems here.

BigLep commented 3 weeks ago

In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.

Let's get to it once we've fixed all the other problems here.

For visibility, it was decided that it would be useful to merge the DBs into a single DB. The work is happening in https://github.com/filecoin-project/lotus/pull/12421