Open aarshkshah1992 opened 1 month ago
For any and all of these that have ways to reproduce it would be great to coordinate and add the RPC call behind any of these issues to the RPC benchmark tool that Fil-B has been maintaining - https://github.com/fil-builders/benchmark-rpc/blob/main/pages/index.js#L21
For example this issue https://github.com/filecoin-project/lotus/issues/10940 should be easy to reproduce in a live test. I created a ticket for it https://github.com/FIL-Builders/benchmark-rpc/issues/1 to add it to the web app http://benchmark-rpc.fil.builders/
They're not reset and rehydrated when a node syncs from a snapshot
msgindex is @ https://github.com/filecoin-project/lotus/blob/718fc0330f313a37d5deb73e6e05f0fac2c9b772/cmd/lotus/daemon.go#L647
It at least has a pattern we can follow for others. But it also overlaps with a backfill operation, so we may end up taking care of snapshot import with a general backfill routine if we get that right.
This list seems pretty complete. IMO, the highest priority is fixing the indexing issues:
EthGetBlockBy*
commands should force the indexer to index that block (and its parents).Indexing is done asynchronously to tipset/message execution but APIs that rely on these indices do not account for the async nature of indexing which leads to racy data avability issues for lookups at/near the chain head
IMO, the best way to handle this case is the dance we discussed on the call:
This will miss uncles, but StateSearchMsg
is designed to only find messages on the main chain.
I took a look at how geth handles stuff like this and... they also appear to index asynchronously and handle this case by returning an error if the node is currently indexing a block. That's not a terrible option... but it would be a larger breaking change.
This is a great overview @aarshkshah1992 - thanks for writing it up. A few questions, some of which are coming from a newbie/ignorant-of-the-code perspective. I'm happy to chat on any of these elsewhere or offline, but figured to ask here so it's public.
@BigLep
Let me look into 2.
I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.
That depends on what we do with the API. If F3 is "fast enough", we could just not expose anything after finality. But... that's probably not going to work well.
I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.
I agree there's no reason to keep them separate.
@Stebalien @BigLep @rvagg
In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.
Let's get to it once we've fixed all the other problems here.
In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.
Let's get to it once we've fixed all the other problems here.
For visibility, it was decided that it would be useful to merge the DBs into a single DB. The work is happening in https://github.com/filecoin-project/lotus/pull/12421
This issue aims to be a meta-issue to capture and track work that needs to be done to enhance correctness, performance, and stability of the ETH RPC API on snapshot synced nodes. Note that improving performance for ETH RPC API on archival nodes is out of scope for this issue and will be addressed by a future issue.
Our goal is to improve the developer experience (DX) for key partners, including:
Correctness and data availability issues in the chain state Indexes used by the ETH RPC API
Currently, we maintain three primary indices on the chain state, which are essential for both correctness and performance of multiple ETH RPC APIs.
Transaction Index
Message Index
Event Index
All of the above indices suffer from some or all of the following problems that need to be fixed:
lotus-shed
backfilling CLI that users rely on for manually backfilling the indices is broken as all the Indices are persisted in Sqlite and Sqllite only supports a single writer. This effectively means that backfilling races with indexing new/ongoing state transitionsCorrectness problems in the ETH Events API
12111 - Event Filter APIs have raciness that can return incorrect results.
10911 - Mismatch between the block hash returned by ETH Get Block API and the block hash returned by the ETH Events API. This one could have been caused by a re-org but a solid itest to verify that this is no longer a problem would be great.
11589 AND #11153 Event Filter APIs should work with the HTTP Gateway as expected by ETH tooling.
10940 -
eth_getLogs
should differentiate between "processed the block it has no events" vs "never seen this block" errors. We already have the required scaffolding and metdata for this in place but need to fix the error handling here and write some solid testsIn-memory block caching for perf improvements
Multiple ETH RPC APIs frequently need to lookup Filecoin Tipsets and convert them to the correspondong Ethereum block representations. These lookups are performed on the chainstore which is expensive. We should cache these tipsets/blocks in an LRU cache. See #10520.
Miscellaneous correctness bugs from the backlog
10909 -
eth_getBlock
does not confirm to ETH RPC spec for Filecoin null rounds (null rounds are a quirk in Filecoin and need to be handled correctly here).11635 - The ETH Trace API currently fails to include the byte code of the deployed smart contract in the trace output for transactions that deploy smart contracts. IIRC, Blockscout really needed this to be able to show the contract byte code on their explorer.
10357 - Correctness bug in
eth_getTransactionCount
.