Excessive Bandwidth Consumption Post v36.0 Node Upgrade

88plug commented 1 week ago

Description:
Since upgrading to Akash node release v36.0, nodes have been consuming an unusually high amount of bandwidth, far exceeding the previous usage. This has resulted in "out of bandwidth" notifications across multiple nodes in various datacenters, as well as noticeable lag on residential networks. No changes were made to the default deployment code.

To Reproduce:

Deploy an Akash node using the default deployment code.
Monitor the bandwidth usage of the node.

Expected Behavior:
The node should have a sustained bandwidth usage of approximately 5,100,000 BPS (5.1 Mbps) for incoming and 6,000,000 BPS (6 Mbps) for outgoing traffic.

Traffic Analysis:
Upon reviewing the provided screenshot:

Pre-Upgrade Bandwidth Usage:
- Incoming bandwidth: Approximately 5,100,000 BPS (5.1 Mbps).
- Outgoing bandwidth: Approximately 6,000,000 BPS (6 Mbps).
Post-Upgrade Bandwidth Usage: There is a significant spike in bandwidth consumption.
- Outgoing bandwidth: Approximately 600,000,000 BPS (600 Mbps).
- Incoming bandwidth: Approximately 50,000,000 BPS (50 Mbps).

Attempted Fixes:
I have attempted to limit the P2P connections and adjust the send_rate and recv_rate parameters in the Cosmos SDK configuration. Despite these efforts, the issue persists.

Request:
Please examine the issue deeper and push a fix to stop the irregular bandwidth consumption.

Recommendation:
Anyone running an Akash node should check their bandwidth consumption and traffic to ensure they are not affected by this issue. Create point release for v36.0 that stops excessive bandwidth consumption.

Monthly Traffic View:

Before Upgrade Daily:

After Upgrade Daily:

Additional Context:
This issue is critical as it affects the performance and reliability of the nodes across various datacenters and residential networks. Immediate attention and resolution are required.

chainzero commented 1 week ago

@88plug - could you please confirm:

1). Are these nodes being built via Akash Helm Charts? Asking because the Helm Charts sets - minimum_gas_prices: 0.025uakt - and want to ensure this setting is in place in affected nodes.

2). During node start up - are there any log entries regarding 0 gas prices?

chainzero commented 1 week ago

Review from additional node operator impacted by increased P2P traffic:

Nodes using both CLI and Helm Chart builds are experiencing heightened traffic
The CLI node build has minimum_gas_prices: 0.025uakt setting in app.toml
Helm Chart default values were not changed and thus should have minimum_gas_prices: 0.025uakt
The bandwidth is NOT increasing further over time. I.e. the P2P bandwidth rose considerably a few days ago and has been steady at that level since.
No evidence of 0 gas fees in node logs but logs are littered with failed to add vote errors such as:

Jun 21 19:46:38 mainnet-node start-node.sh[32110]: ERR failed to process message err="error adding vote" height=16846721 module=consensus msg_type=*consensus.VoteMessage peer=2a3ba81a7ddb00016af1593f925aed390c4bcca9 round=0
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: INF failed attempting to add vote err="expected 16846720/1/2, but got 16846720/0/2: unexpected step" module=consensus... (175 KB left)

c29r3 commented 1 week ago

Review from additional node operator impacted by increased P2P traffic:

Nodes using both CLI and Helm Chart builds are experiencing heightened traffic

The CLI node build has minimum_gas_prices: 0.025uakt setting in app.toml

Helm Chart default values were not changed and thus should have minimum_gas_prices: 0.025uakt

The bandwidth is NOT increasing further over time. I.e. the P2P bandwidth rose considerably a few days ago and has been steady at that level since.

No evidence of 0 gas fees in node logs but logs are littered with failed to add vote errors such as:
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: ERR failed to process message err="error adding vote" height=16846721 module=consensus msg_type=*consensus.VoteMessage peer=2a3ba81a7ddb00016af1593f925aed390c4bcca9 round=0
Jun 21 19:46:38 mainnet-node start-node.sh[32110]: INF failed attempting to add vote err="expected 16846720/1/2, but got 16846720/0/2: unexpected step" module=consensus... (175 KB left)

It seems that this issue is observed in other networks as well, for example, in the Sentinel Network

https://x.com/zeroservices_eu/status/1784553362316288174

I'm not sure exactly how this problem arises, but it seems that it spreads through specific peers (full node\RPC)

grep "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/config.toml

grep -A 2 -B 2 "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/addrbook.json

I'm not sure why this P2P address appears in the address book 123 times 😳

grep "00a39ac3ec012ffa3116a162c17f49df484d0298" .akash/config/addrbook.json | wc -l
123

addrbook.json

troian commented 1 week ago

@c29r3 can you backup your addrbook and try one from polkachu

c29r3 commented 1 week ago

@c29r3 can you backup your addrbook and try one from polkachu

Done, but err="error adding vote" still exists

Here is the traffic for the last 48 hours

c29r3 commented 1 week ago

I enabled the --log_level debug mode and saved logs for the last 20 minutes from my RPC node

sudo journalctl -u akash.service --no-hostname --since "20 minutes ago" | grep -v p2p > akash_20min_log.txt

https://snapshots.c29r3.xyz/akash/akash_20min_log_debug.zip

88plug commented 1 week ago

Fixes excessive bandwidth #285

I did a battery of tests over the weekend and was able to resolve the issue.

The issue appears to be with the p2p seed_mode is set to true for the node in the Helm charts.

Cosmos default is pex true and seed mode false.

I have updated the Helm charts and tested with seed_mode disabled and the excessive bandwidth issue is resolved.

For reference in my testing I also found "error adding vote" will show with 0.025uakt fee. So that may indict some other issue, but it was not related to the bandwidth.

chainzero commented 1 week ago

Issue was caused by IBC relayers allowing zero/very low gas TXs onto the network and into mempool. While Akash RPC/validators are universally configured to reject zero gas TXs, a number of IBC relayers were not configured to reject these TXs.

Issue was resolved by:

1). Specific validators intentionally set their min gas requirement to zero to allow these TXs to be written to the chain and thus cleansing the validator mempools of such TXs.

2). Worked with current IBC relayers to ensure they have min gas settings.

Network P2P traffic is now normalized.

troian commented 6 days ago

@Krewedk0 It's not quite correct.

All validators had and have minimum gas fees >0
There were a few relayers using default configuration from chain registry which had minimum gas fees set 0.
Due to IBC v4 design, transactions coming via IBC with 0 or very small gas fees could enter mempool and stuck there forever, because:
- all validators has gas fees set to correct level
- tx recheck in cosmos SDK v0.45.x does not work correctly, even tho transactions were expired, they still stayed in the mempool.

Krewedk0 commented 5 days ago

@troian Deleted my last comment to not give bad people good ideas. But i ran some tests last night and you can actually do very nasty stuff with the setup i mentioned. Also Chandra Station and 16psyche still have 0 min gas prices set on their validator nodes.

akash-network / support

Excessive Bandwidth Consumption Post v36.0 Node Upgrade #231