cosmos / gaia

Cosmos Hub
https://hub.cosmos.network
Apache License 2.0
452 stars 665 forks source link

RPC node falls behind since >=v11 when queried #2733

Open exp0nge opened 9 months ago

exp0nge commented 9 months ago

Summary of Bug

We have noticed severe performance degradation since v11 where nodes we query against fall behind very quickly. Attached below are some logs. This behavior is not notice-able in non-RPC mode nodes (nodes that are not being queried).

2:29PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=10 numToDial=0
2:29PM INF Timed out dur=3000 height=16902284 module=consensus round=0 step=3
2:29PM INF commit synced commit=436F6D6D697449447B5B323520363120323232203133372032353220323531203133352032353220313237203239203139362031303320363820353220393420373320323238203130382031373620323231203234312031313320
2:29PM INF committed state app_hash=193DDE89FCFB87FC7F1DC46744345E49E46CB0DDF17162ACC7A3BD207F3318D0 height=16902284 module=state num_txs=5
2:29PM INF Timed out dur=-542.021502 height=16902285 module=consensus round=0 step=1
2:29PM INF indexed block exents height=16902284 module=txindex
2:29PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=A9AE18C2422F1781EC1046E6B11EF70A3E2E679FD2301E33C2622AF4D09850FF commit_round=0 height=16902285 module=consensus proposal={}
2:29PM INF received complete proposal block hash=A9AE18C2422F1781EC1046E6B11EF70A3E2E679FD2301E33C2622AF4D09850FF height=16902285 module=consensus
2:29PM INF finalizing commit of block hash={} height=16902285 module=consensus num_txs=1 root=193DDE89FCFB87FC7F1DC46744345E49E46CB0DDF17162ACC7A3BD207F3318D0
2:29PM INF minted coins from module account amount=11906530uatom from=mint module=x/bank
2:29PM INF client state updated client-id=07-tendermint-913 height=1-5327682 module=x/ibc/client
2:29PM INF packet acknowledged dst_channel=channel-0 dst_port=transfer module=x/ibc/channel sequence=166437 src_channel=channel-391 src_port=transfer
2:29PM INF executed block height=16902285 module=state num_invalid_txs=0 num_valid_txs=1
2:29PM INF Connection is closed @ recvRoutine (likely by the other side) conn={"Logger":{}} module=p2p peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM INF service stop impl={"Logger":{}} module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
2:29PM INF service stop impl={"Data":{},"Logger":{}} module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM INF Reconnecting to peer addr={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956} module=p2p
2:29PM INF service start impl="Peer{MConn{65.108.227.217:14956} 20e1000e88125698264454a884812746c2eb4807 out}" module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14
2:29PM INF service start impl=MConn{65.108.227.217:14956} module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM INF Timed out dur=3000 height=16902285 module=consensus round=0 step=3
2:29PM INF commit synced commit=436F6D6D697449447B5B313130203230332031343720323236203230372031343920313134203138352034332031353720363520313238203139332031323920383920373420333920343420313033203135322032323520373720
2:29PM INF committed state app_hash=6ECB93E2CF9572B92B9D4180C181594A272C6798E14D40A95E83439F28D963B4 height=16902285 module=state num_txs=1
2:29PM INF Timed out dur=-6316.474049 height=16902286 module=consensus round=0 step=1
2:29PM INF indexed block exents height=16902285 module=txindex
2:29PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=7B970C0472E419D0DAF04DC28B6158C219360B37852D9547884B6D47A1217585 commit_round=0 height=16902286 module=consensus proposal={}
2:29PM INF received complete proposal block hash=7B970C0472E419D0DAF04DC28B6158C219360B37852D9547884B6D47A1217585 height=16902286 module=consensus
2:29PM INF Connection is closed @ recvRoutine (likely by the other side) conn={"Logger":{}} module=p2p peer={"id":"57a5297537b9b6ef8b105c08a8ad3f6ac452c423","ip":"51.81.35.11","port":1618}
2:29PM INF service stop impl={"Logger":{}} module=p2p msg={} peer={"id":"57a5297537b9b6ef8b105c08a8ad3f6ac452c423","ip":"51.81.35.11","port":1618}
2:29PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

cc @ckarabats

Version

>=v11

Steps to Reproduce


For Admin Use

MSalopek commented 9 months ago

Possibly related to #2415, #2726.

Have you tried running your node in debug mode (temporarily) and checking which RPC endpoints are being queried? You can do so by setting this in your config.toml:

# Output level for logging, including package level options
log_level = "debug"

Please note that you may need to set up additional software in case you are exposing your node to the public, such as a load balancer or a proxy, much like you would if you were exposing any other backend service.

exp0nge commented 9 months ago

Possibly related to #2415, #2726.

Have you tried running your node in debug mode (temporarily) and checking which RPC endpoints are being queried? You can do so by setting this in your config.toml:

# Output level for logging, including package level options
log_level = "debug"

Please note that you may need to set up additional software in case you are exposing your node to the public, such as a load balancer or a proxy, much like you would if you were exposing any other backend service.

The following gRPC requests are used:

In particular GetBlockByHeightRequest is used heavily alongside GetTxRequest (I am switching over to GetEvents to batch the per tx call)

exp0nge commented 9 months ago

Possibly related to #2415, #2726.

Have you tried running your node in debug mode (temporarily) and checking which RPC endpoints are being queried? You can do so by setting this in your config.toml:

# Output level for logging, including package level options
log_level = "debug"

Please note that you may need to set up additional software in case you are exposing your node to the public, such as a load balancer or a proxy, much like you would if you were exposing any other backend service.

By the way, I read through some of the suggestions in https://github.com/cosmos/gaia/issues/2415, but this is a perf. regression. We have solved in creative ways, but the nodes consistently die.

edit: We trust Interchain's/this repo's interpretation of the blockchain state, adding a layer of indrection like an explorer, while possible, is something we would very much like to avoid.

mmulji-ic commented 7 months ago

edit: We trust Interchain's/this repo's interpretation of the blockchain state, adding a layer of indrection like an explorer, while possible, is something we would very much like to avoid.

Indeed @exp0nge , every layer and redirection adds overhead and complexity - thats something that we always want to avoid if possible. The performance bottlenecks that you've mentioned, regression or otherwise, lie with the lower levels of the stack (SDK / Comet). If theres a specific endpoint that has regressed in terms of performance, then we can check with the relevant teams. Are the following endpoints GetBlockByHeightRequest & GetTxRequest causing most problems?

exp0nge commented 7 months ago

edit: We trust Interchain's/this repo's interpretation of the blockchain state, adding a layer of indrection like an explorer, while possible, is something we would very much like to avoid.

Indeed @exp0nge , every layer and redirection adds overhead and complexity - thats something that we always want to avoid if possible. The performance bottlenecks that you've mentioned, regression or otherwise, lie with the lower levels of the stack (SDK / Comet). If theres a specific endpoint that has regressed in terms of performance, then we can check with the relevant teams. Are the following endpoints GetBlockByHeightRequest & GetTxRequest causing most problems?

Unsure exactly which call causes this as we use the ones above listed, but the most heavily used ones are: tx.v1beta1: BroadcastTxRequest, GetTxRequest

MSalopek commented 3 months ago

@exp0nge Hey, has this issue been fixed on v15.1.0 for your usecase?