cosmos / relayer

An IBC relayer for ibc-go
Apache License 2.0
389 stars 1.71k forks source link

Timing out packets results in failures in some edge cases #567

Closed jtieri closed 2 years ago

jtieri commented 2 years ago

Yesterday we experienced a bug when attempting to relay packets between Sommelier & Osmosis with the following path information.

        src:
            chain-id: sommelier-3
            client-id: 07-tendermint-0
            connection-id: connection-0
            channel-id: channel-0
            port-id: transfer
            order: unordered
            version: ics20-1
        dst:
            chain-id: osmosis-1
            client-id: 07-tendermint-1745
            connection-id: connection-1348
            channel-id: channel-165
            port-id: transfer
            order: unordered
            version: ics20-1

A packet ended up in the queue that could not be relayed after repeated attempts and ultimately needed to be cleared via Hermes. Other packets in the queue were able to be cleared via a command that takes a sequence number but any time this one specific packet was touched the following errors would be logged.

E[2022-01-26|13:38:53.258] - [osmosis-1] -> err(rpc error: code = InvalidArgument desc = failed to execute message; message index: 1: receive packet verification failed: block timestamp >= packet timeout timestamp (2022-01-26 19:38:44.729870911 +0000 UTC >= 2022-01-26 19:29:50.316507123 +0000 UTC): packet timeout: invalid request)

This was the problematic transfer

{
  "body": {
    "messages": [
      {
        "@type": "/ibc.applications.transfer.v1.MsgTransfer",
        "source_port": "transfer",
        "source_channel": "channel-0",
        "token": {
          "denom": "usomm",
          "amount": "1000000"
        },
        "sender": "redacted",
        "receiver": "redacted",
        "timeout_height": {
          "revision_number": "1",
          "revision_height": "2958035"
        },
        "timeout_timestamp": "1643225390316507123"
      }
    ],
    "memo": "",
    "timeout_height": "0",
    "extension_options": [],
    "non_critical_extension_options": []
  },
  "auth_info": {
    "signer_infos": [],
    "fee": {
      "amount": [],
      "gas_limit": "200000",
      "payer": "",
      "granter": ""
    }
  },
  "signatures": []
}

After reaching out in the Osmosis IBC Integrators TG chat we received some reports of others running into this issue in the recent past, to quote some of that conversation here.

we debugged this, source of error on junoswap was related to nodes falling back slightly and a packet-timeout-timestamp set relatively low

upping the relative timeout height calculated from 2min to 5min solved the issues some users were experiencing

osmosis uses 10min for reference
jtieri commented 2 years ago

Saw this issue arise again today during a high congestion period on the channels between Juno & Osmosis with this path information

cwosmo:
        src:
            chain-id: juno-1
            client-id: 07-tendermint-0
            connection-id: connection-0
            channel-id: channel-47
            port-id: wasm.juno1v4887y83d6g28puzvt8cl0f3cdhd3y6y9mpysnsp3k8krdm7l6jqgm0rkn
            order: unordered
            version: ics20-1
        dst:
            chain-id: osmosis-1
            client-id: 07-tendermint-1457
            connection-id: connection-1142
            channel-id: channel-169
            port-id: transfer
            order: unordered
            version: ics20-1

Here is the error log again

E[2022-02-03|11:12:04.036] - [osmosis-1] -> err(rpc error: code = InvalidArgument desc = failed to execute message; message index: 1: receive packet verification failed: block timestamp >= packet timeout timestamp (2022-02-03 17:11:52.731651954 +0000 UTC >= 2022-02-03 07:47:09.797465124 +0000 UTC): packet timeout: invalid request) 

Once this error happens it becomes virtually impossible to relay on this path with the go-relayer, even attempting to manually clear the packets by sequence number fails for the troubled packets. In this case there were 10+ packets in the queue and all but 3 could be cleared

rly q unrelayed-packets cwosmo
{"src":[13,14,15],"dst":null}

I successfully updated the clients and switched nodes hoping that this would alleviate the issue but no luck there.

rly tx update-clients cwosmo
I[2022-02-03|11:15:07.269] ✔ [juno-1]@{1745098} - msg(0:/ibc.core.client.v1.MsgUpdateClient) hash(9FB33A4AB6BAC081D28E76176CFC40293123F47F1030227D9C286028C688369C) 
I[2022-02-03|11:15:15.215] ✔ [osmosis-1]@{3065778} - msg(0:/ibc.core.client.v1.MsgUpdateClient) hash(968B00541BBDC536165A0D656EB5977B092BB2695075A9F1759FECF7F5F67995) 
I[2022-02-03|11:15:15.215] ★ Clients updated: [juno-1]client(07-tendermint-0) {{1 1745096}}->{1745096} and [osmosis-1]client(07-tendermint-1457) {{1 3065775}}->{3065775} 
adizere commented 2 years ago

is it possible that you are submitting a RecvPacket here instead of submitting a TimeoutPacket? the error seems to indicate that

I’m asking also because it looks like osmosis is the the destination of the transaction

[osmosis-1] -> err(rpc error: code = InvalidArgument desc = failed to execute message; message index: 1: receive packet verification failed…

but the destination of timeouts should be the chain that originally did SendPacket (sommelier)

jtieri commented 2 years ago

Hey Adi, thanks for commenting on the issue!

It definitely seems like this is what's happening. Looks like maybe timeout timestamps are not being handled correctly and the relayer may be trying to send RecvPacket vs TimeoutPacket like you mentioned.

What's interesting is that this does not happen in all cases, I have checked our tx history for our primary relayer address and I can find multiple successful TimeoutPackets

jtieri commented 2 years ago

In the process of trying to fix ordered channel timeouts I noticed that timeouts never work when utilizing timestamp based timeouts.

It looks like in RelayPacketFromSequence() we call relayPacketsFromResultTx() and inside of this function there is no case for the timestamp being set.

https://github.com/cosmos/relayer/blob/8e627ebf1c182e307b99b2eb78807d202776dfef/relayer/provider/cosmos/provider.go#L1098-L1105