Archive node missing blocks when under heavy gRPC pressure

RiccardoM commented 3 years ago

Summary of Bug

When under heavy gRPC pressure (a lot of requests being made), full node can start lacking behind in blocks validation.

Version

v0.40.1

Steps to Reproduce

Start a full node with pruning = "nothing"
Start performing a lot of gRPC requests (around 100 per block)
The node will start to slowly lack behind in block syncing

Context

We are currently developing BDJuno, a tool that allows to listen to a chain state and parses the data into a PostgreSQL database. In order to do so, it acts in two ways at the same time:

Listens for new blocks
Parses all old blocks

For each block, it then reads the different modules' states and stores them inside the PostgreSQL database. What we do is we a snapshot of the state for each block and store it. To do so, we use gRPC to get all the data that can change from one block to another (i.e. delegations, unbonding delegations, redelegations, staking commissions, etc).

As we also need to parse old blocks and get the state at very old heights, we setup an archive node with pruning = "nothing".

When we first started our parser, everything was working properly. The node was able to keep up with syncing new blocks and answering to gRPC calls properly.

Recently, however, we noticed that the node started to lack behind the chain state, was over 500 blocks behind. So, we stopped the parser and let the node catch up again with the chain state. Then, we restarted the parser. One week later and the node is once again more than 1,000 blocks behind the current chain height.

Note
I have no idea if this happens only because the pruning is set to nothing. However, I believe this should be investigated as it might result in some tools (eg. explorers) making the nodes stop in the future if too many requests are done to them. It could even be exploited via a DDoS attack to validator nodes if this results to happen also to nodes that have the pruning option set to default or everything.

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

tac0turtle commented 3 years ago

I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network.

RiccardoM commented 3 years ago

I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network.

Are you referring to the RPC or gRPC? Cause we noticed this problem only when querying using gRPC. When we only use RPC it has not problems

tac0turtle commented 3 years ago

All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method.

RiccardoM commented 3 years ago

All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method.

Ok thanks. Is there an issue opened in Tendermint about this? Maybe we can link it here for future reference

tac0turtle commented 3 years ago

There doesn't seem to be one, it's also a mix of multiple issues. Do you want to open an issue that links to this one?

alexanderbez commented 3 years ago

That still doesn't describe it @marbar3778. Why do RPC and legacy API endpoints work "fine", i.e. no regressions, yet gRPC slows down nodes considerably?

tac0turtle commented 3 years ago

I can reproduce this on tendermint RPC as well. It's a bit harder than gRPC but still present. gRPC was built to handle concurrent requests, but I don't think any of our stack can handle concurrent requests at high volume.

To reproduce with Tendermint:

spin up two nodes using kv store app.
use tm-load-test on the node that isn't the validator
- you may need two instances of tm-load-test
observe the node falling behind

alexanderbez commented 3 years ago

I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient? Why did block explorers and clients never report such issues for RPC and the legacy API?

tac0turtle commented 3 years ago

I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient?

It is more efficient in almost all possible ways if tendermint was not used as a global mutex. Right now all calls are routed through tendermint and the known mutex contention when using RPC is being felt.

Why did block explorers and clients never report such issues for RPC and the legacy API?

I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public.

alexanderbez commented 3 years ago

I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public.

They were though. Juno for example did this w/o slowing down the connected node at all. Block explorers continuously use and call the RPC to fetch and index data to external data sources.

aaronc commented 3 years ago

Can someone from our team investigate if there has indeed been a performance regression with gRPC related to these cases? My guess is that it's likely not gRPC per se, but something else in the query handling... Can you triage @clevinson ?

tac0turtle commented 3 years ago

I think this https://github.com/cosmos/cosmos-sdk/pull/10045 may help out. Grpc is natively concurrent, but all the queries are queued behind a single mutex. 0.34.13 makes this mutex a RWmutex but the mentioned pr should not require grpc requires to be routed through tendermint. @RiccardoM would love to see if the pr helps

faddat commented 3 years ago

I am almost certain this is related in some way to

https://github.com/cosmos/gaia/issues/972 https://github.com/cosmos/gaia/issues/704

...and I've definitely seen similar behavior to this on any node I've used for relaying.

tac0turtle commented 2 years ago

closing this for now since grpc is no longer routed through tendermint

cosmos / cosmos-sdk