akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

Excessive Bandwidth Consumption Post v36.0 Node Upgrade #231

Closed 88plug closed 1 week ago

88plug commented 1 week ago

Description:
Since upgrading to Akash node release v36.0, nodes have been consuming an unusually high amount of bandwidth, far exceeding the previous usage. This has resulted in "out of bandwidth" notifications across multiple nodes in various datacenters, as well as noticeable lag on residential networks. No changes were made to the default deployment code.

To Reproduce:

  1. Deploy an Akash node using the default deployment code.
  2. Monitor the bandwidth usage of the node.

Expected Behavior:
The node should have a sustained bandwidth usage of approximately 5,100,000 BPS (5.1 Mbps) for incoming and 6,000,000 BPS (6 Mbps) for outgoing traffic.

Traffic Analysis:
Upon reviewing the provided screenshot:

Attempted Fixes:
I have attempted to limit the P2P connections and adjust the send_rate and recv_rate parameters in the Cosmos SDK configuration. Despite these efforts, the issue persists.

Request:
Please examine the issue deeper and push a fix to stop the irregular bandwidth consumption.

Recommendation:
Anyone running an Akash node should check their bandwidth consumption and traffic to ensure they are not affected by this issue. Create point release for v36.0 that stops excessive bandwidth consumption.

Monthly Traffic View: image

Monthly Traffic View:

image

Before Upgrade Daily:

image

After Upgrade Daily:

image

Additional Context:
This issue is critical as it affects the performance and reliability of the nodes across various datacenters and residential networks. Immediate attention and resolution are required.

chainzero commented 1 week ago

@88plug - could you please confirm:

1). Are these nodes being built via Akash Helm Charts? Asking because the Helm Charts sets - minimum_gas_prices: 0.025uakt - and want to ensure this setting is in place in affected nodes.

2). During node start up - are there any log entries regarding 0 gas prices?

chainzero commented 1 week ago

Review from additional node operator impacted by increased P2P traffic:

Jun 21 19:46:38 mainnet-node start-node.sh[32110]: ERR failed to process message err="error adding vote" height=16846721 module=consensus msg_type=*consensus.VoteMessage peer=2a3ba81a7ddb00016af1593f925aed390c4bcca9 round=0
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: INF failed attempting to add vote err="expected 16846720/1/2, but got 16846720/0/2: unexpected step" module=consensus... (175 KB left)
c29r3 commented 1 week ago

Review from additional node operator impacted by increased P2P traffic:

  • Nodes using both CLI and Helm Chart builds are experiencing heightened traffic
  • The CLI node build has minimum_gas_prices: 0.025uakt setting in app.toml
  • Helm Chart default values were not changed and thus should have minimum_gas_prices: 0.025uakt
  • The bandwidth is NOT increasing further over time. I.e. the P2P bandwidth rose considerably a few days ago and has been steady at that level since.
  • No evidence of 0 gas fees in node logs but logs are littered with failed to add vote errors such as:
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: ERR failed to process message err="error adding vote" height=16846721 module=consensus msg_type=*consensus.VoteMessage peer=2a3ba81a7ddb00016af1593f925aed390c4bcca9 round=0
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: INF failed attempting to add vote err="expected 16846720/1/2, but got 16846720/0/2: unexpected step" module=consensus... (175 KB left)

It seems that this issue is observed in other networks as well, for example, in the Sentinel Network

image

https://x.com/zeroservices_eu/status/1784553362316288174

I'm not sure exactly how this problem arises, but it seems that it spreads through specific peers (full node\RPC)

image image
grep "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/config.toml

image

grep -A 2 -B 2 "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/addrbook.json image

I'm not sure why this P2P address appears in the address book 123 times 😳

grep "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/addrbook.json | wc -l
123

addrbook.json

troian commented 1 week ago

@c29r3 can you backup your addrbook and try one from polkachu

c29r3 commented 1 week ago

@c29r3 can you backup your addrbook and try one from polkachu

Done, but err="error adding vote" still exists

Here is the traffic for the last 48 hours

image
c29r3 commented 1 week ago

I enabled the --log_level debug mode and saved logs for the last 20 minutes from my RPC node

sudo journalctl -u akash.service --no-hostname --since "20 minutes ago" | grep -v p2p > akash_20min_log.txt

https://snapshots.c29r3.xyz/akash/akash_20min_log_debug.zip

88plug commented 1 week ago

Fixes excessive bandwidth #285

I did a battery of tests over the weekend and was able to resolve the issue.

The issue appears to be with the p2p seed_mode is set to true for the node in the Helm charts.

Cosmos default is pex true and seed mode false.

I have updated the Helm charts and tested with seed_mode disabled and the excessive bandwidth issue is resolved.

For reference in my testing I also found "error adding vote" will show with 0.025uakt fee. So that may indict some other issue, but it was not related to the bandwidth.

chainzero commented 1 week ago

Issue was caused by IBC relayers allowing zero/very low gas TXs onto the network and into mempool. While Akash RPC/validators are universally configured to reject zero gas TXs, a number of IBC relayers were not configured to reject these TXs.

Issue was resolved by:

1). Specific validators intentionally set their min gas requirement to zero to allow these TXs to be written to the chain and thus cleansing the validator mempools of such TXs.

2). Worked with current IBC relayers to ensure they have min gas settings.

Network P2P traffic is now normalized.

troian commented 6 days ago

@Krewedk0 It's not quite correct.

  1. All validators had and have minimum gas fees >0
  2. There were a few relayers using default configuration from chain registry which had minimum gas fees set 0.
  3. Due to IBC v4 design, transactions coming via IBC with 0 or very small gas fees could enter mempool and stuck there forever, because:
    • all validators has gas fees set to correct level
    • tx recheck in cosmos SDK v0.45.x does not work correctly, even tho transactions were expired, they still stayed in the mempool.
Krewedk0 commented 5 days ago

@troian Deleted my last comment to not give bad people good ideas. But i ran some tests last night and you can actually do very nasty stuff with the setup i mentioned. Also Chandra Station and 16psyche still have 0 min gas prices set on their validator nodes.