cosmos / ibc-go

Inter-Blockchain Communication Protocol (IBC) implementation in Golang.
https://ibc.cosmos.network/
MIT License
542 stars 580 forks source link

Default ordering in gRPC responses assumes lexicographic values of Height, not numerical value #1399

Open adizere opened 2 years ago

adizere commented 2 years ago

Anca pointed out that the default ordering in gRPC responses works on lexicographic value of Height, not on their numerical value.

Here is an example (trimmed from an actual query). If the chain has states for heights: 830, 821, 766, 222, 104, 76, 30, 25, 20, 10 The gRPC query (QueryConsensusStatesRequest) returns: 10, 104, 20, 222, 25, 30, 76, 766, 8, 821, 830

Originally posted by @adizere in https://github.com/cosmos/ibc-go/issues/798#issuecomment-1124789697

crodriguezvega commented 2 years ago

Thank you, @adizere and Anca!

catShaark commented 2 years ago

I suggest we change how we index ConsensusState in the store, I've posted an issue to describe my proposal.

catShaark commented 2 years ago

I suggest we change how we index ConsensusState in the store, I've posted an issue to describe my proposal.

So It turns out we can't do that as @AdityaSripal explain here, he suggested we instead use the metadata stored in tendermint client that does store the consensus state in numbering order. I think that's the best way to kill this problem. Here is what I think we should do in detail :

plz let me know what you guys think.

AdityaSripal commented 2 years ago

Thanks for the summary and suggestions @catShaark

add this "iterateConsensusStates" key to keys.go in 02-client/types

I'm a bit hesitant to do this since it's very tendermint client specific. Ideally we get to a point where we can easily upgrade the 24-host keys without major disruptions. (Connection Upgradability 👀 ). So upgrading core IBC to remove this technical debt will eventually be possible but is not a high priority for us at the moment.

Given that, i'd prefer the least disruptive short-term solution to get in for now.

Isn't it feasible to just get the consensus states in lexicographical order and then do a numerical sort in the rpc handler itself? If that's not too difficult, that might be fine to do for now.

If that isn't feasible then we can look to the next "least disruptive" solution.

Also, would be good to understand what if any are the concrete negative effects of this issue.

cc: @colin-axner @crodriguezvega

crodriguezvega commented 2 years ago

Isn't it feasible to just get the consensus states in lexicographical order and then do a numerical sort in the rpc handler itself? If that's not too difficult, that might be fine to do for now.

@catShaark and I discussed this a few weeks ago and the problem is that even if you order in the handler itself the consensusStateHeights that you retrieve from the store here this would only order the results for a given page (assuming that there are many results and you need pagination). So you might order the results returned for the page, but those results are not ordered globally for the whole set of results potentially available.

Also, would be good to understand what if any are the concrete negative effects of this issue.

@adizere can you (or maybe Anca) please explain us how this problem affects hermes? And how bad would it be if this cannot be solved until we have connection upgradability, as @AdityaSripal mentions in his comment above as the preferred solution?

adizere commented 2 years ago

please explain us how this problem affects hermes?

This affects Hermes in the sense that whenever the relayer does the QueryConsensusStatesRequest it puts a lot of pressure on the full node. This translated into operators needing very expensive machines (to handle this query) and instability of the node itself (when under high query pressure). More details on the production requirements follow.

The relayer needs to:

  1. as part of misbehavior checking: the relayer traverses the consensus states of a client and checks each of them; this traversal of consensus states is usually done in reverse order, starting from the latest one going back towards the oldest state (checking each state that is within the trusting period)

  2. as part of client upgrade and client update CLIs: the relayer needs a trusted height (a height at which that client has a consensus state installed). to resolve this trusted height, it needs to query all of the client's consensus states and traverse the list in reverse order

  3. (most important for production) as part of constructing proofs during normal-case packet relaying: if there are concurrent relayers updating a client, then Hermes cannot use the client's latest height (because another relayer has updated the client in the meantime), and it consenquently has to resolve the trusted height (similar to example 2 above). to resolve the trusted height, it again needs to fetch consensus states and traverse them in reverse order of their height until it finds one that is smaller than X (X is the height at which Hermes wants to construct proofs)

if you find it useful, the requirements in all of these examples are as follows:

For complete context, I would add that the QueryConsensusStatesRequest gRPC request is one of the most expensive calls that the relayers are doing in production. I did some quick benchmarks (in Sept last year) and this query was one order of magnitude slower than other client queries (it was 10 seconds I think, while most queries were at most 1 second on cheap machines). It is a major cause for pressure on full nodes (and costs, because full nodes have to be high-powered nodes that support high query volume). For reference, one of the most impactful optimization we did which improved the stability of relaying in Sept last year was related to skipping this gRPC request (https://github.com/informalsystems/ibc-rs/pull/1379).