Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago

jununifi commented 5 months ago

This issue is typical to Osmosis node and for other nodes didn't detect such problem.

Way to reproduce:

Start Osmosis node and keep it open for 1 hour

Start go relayer connecting to Osmosis node

Jan 29 16:40:01 vultr rly[191642]: ts=2024-01-29T16:40:01.948613Z lvl=warn msg="Relayer start error" error="error querying channels: post failed: Post \"http://127.0.0.1:26657\": context deadline exceeded"
Jan 29 16:40:01 vultr rly[191642]: Error: error querying channels: post failed: Post "http://127.0.0.1:26657": context deadline exceeded

The way of fixing this was only restarting Osmosis node and starting go-relayer. The problem here is that, once go-relayer stops, it can't restart automatically without Osmosis node restart.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2       600G  279G  296G  49% /

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           32086       11559         586           0       19940       20060
Swap:           8191        2967        5224

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel Core Processor (Broadwell, no TSX, IBRS)
    CPU family:          6
    Model:               61
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            4788.90

jtieri commented 5 months ago

Thanks for opening the issue @jununifi.

Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

danbryan commented 5 months ago

in the past, I have had luck changing this setting in config.toml when dealing with Context deadline exceed.

    [rpc]
    timeout_broadcast_tx_commit = "120s"

jununifi commented 5 months ago

Thanks for opening the issue @jununifi.

Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well. The context deadline exceed issue would be related to memory consumption?

jununifi commented 5 months ago

in the past, I have had luck changing this setting in config.toml when dealing with Context deadline exceed.
    [rpc]
    timeout_broadcast_tx_commit = "120s"

Thanks for your information, we will try updating the configuration. We are using v2.3.1 version of go-relayer.

jtieri commented 5 months ago

Thanks for opening the issue @jununifi. Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.

At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well. The context deadline exceed issue would be related to memory consumption?

Thanks for the additional information! My coworker had the thought that since the issue only occurs after the Osmosis node has been running for some time, perhaps there is a memory leak in the Osmosis binary that causes a longer than usual response when the relayer attempts to query the channels on startup

dylanschultzie commented 5 months ago

I experienced this same issue, but with cosmoshub rather than osmosis.

One thing I'd argue is that this shouldn't close down the rly process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?

jtieri commented 5 months ago

I experienced this same issue, but with cosmoshub rather than osmosis.

One thing I'd argue is that this shouldn't close down the rly process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?

interesting 🤔

i do agree that the current behavior is a bit extreme. i think addressing #1268 by only killing the PathProcessor related to the problematic node is a much better design and should be something we address in the next release. implementing support for multiple RPC endpoints to be configured would help alleviate some of this as well

jtieri commented 5 months ago

One thing that we noticed is that we were using a single 60s timeout for the entire paginated query used for querying all channels. For chains like Osmosis and Cosmos Hub, that have a ton of channels in state, this could potentially be problematic. In #1395 I've made changes to use a 10s timeout per paginated query request so this should alleviate some of the issues.

Will also prioritize only killing PathProcessors related to a problematic node vs. killing the entire rly process.

akc2267 commented 4 months ago

this should be resolved in 2.5.1

ertemann commented 4 months ago

@akc2267 @jtieri we still have this constantly even on 2.5.1

Both larger config and smaller config instances just go to an infinite loop where it reports:


2024-02-22T01:11:44.059393Z warn    Relayer start error {"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"}
Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path```

This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error.

Our configs are correct and we can curl the URLs from the server proving it has access to the nodes

ertemann commented 4 months ago

funny thing is though that although its in a restart loop, we are processing packets.

jtieri commented 4 months ago

@akc2267 @jtieri we still have this constantly even on 2.5.1

Both larger config and smaller config instances just go to an infinite loop where it reports:

2024-02-22T01:11:44.059393Z   warn    Relayer start error {"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"}
Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path```

This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error.

Our configs are correct and we can curl the URLs from the server proving it has access to the nodes

I would not expect to see these Ibc Error: Invalid query path logs for the originally described issue, this seems to be an entirely different issue altogether.

Do you know which chain is reporting these errors? I will likely need to do some local testing to sus out what the issue is with regards to that specific error

ertemann commented 4 months ago

@jtieri it is indeed a separate issues, smaller configs now start.

The error logging shows no exact chain though which is causing this and i cant seem to distill that without doing an elimination test on the config which supports > 50 chains.

ertemann commented 4 months ago

happy to share the config though privately if that helps

cosmos / relayer

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389