cosmos / gaia

Cosmos Hub
https://hub.cosmos.network
Apache License 2.0
458 stars 667 forks source link

~80 peers, but config has been long set to 40 + 10 #2347

Closed gaia closed 10 months ago

gaia commented 1 year ago

Summary of Bug

More peers than allowed in config.

Version

9.0.1

Steps to Reproduce

How is this possible? Note config.toml last modification date, last time the service was restarted and how many peers it should have VS how many it has.

$ cat config.toml | grep _num_ && curl -s http://localhost:26657/net_info | jq -r .result.n_peers && /home/ubuntu/bin/gaiad version && systemctl status gaiad | grep 'Active:' &&
 ps aux | grep [g]aiad && ll config.toml
max_num_inbound_peers = 40
max_num_outbound_peers = 10
82
v9.0.1
     Active: active (running) since Sat 2023-04-01 19:39:22 UTC; 2h 58min ago
ubuntu       140  0.0 50.6 9698672 7915472 ?     Ssl  19:39 102:39 /home/ubuntu/bin/gaiad start
-rw-r--r-- 1 ubuntu ubuntu 19K Mar 20 23:38 config.toml

For Admin Use

faddat commented 1 year ago

Do we think that this is a gaia specific bug, or do we think that this is an issue in tenderment or comet?

gaia commented 1 year ago

Noticed it on Juno v14.0.0 but not Juno v14.1.0. Maybe something got fixed upstream

adizere commented 1 year ago

hi @gaia how many nodes do you have configured in your unconditional_peer_ids ? See spec/p2p:

Unconditional Peers These are IDs of the peers which are allowed to be connected by both inbound or outbound regardless of max_num_inbound_peers or max_num_outbound_peers of user's node reached or not.

adizere commented 1 year ago

Do we think that this is a gaia specific bug, or do we think that this is an issue in tenderment or comet?

Likely Comet.

gaia commented 1 year ago

hi @gaia how many nodes do you have configured in your unconditional_peer_ids ? See spec/p2p:

Unconditional Peers These are IDs of the peers which are allowed to be connected by both inbound or outbound regardless of max_num_inbound_peers or max_num_outbound_peers of user's node reached or not.

Good point and it'd be the obvious answer, but max 2 or 3 on any chain where I see the issue.

adizere commented 1 year ago

It's possible there's a race condition between ensurePeers and acceptRoutine. We still need to double-check this, and currently it's very difficult because there's no specification, but maybe some of the unit tests could help.

Thanks @gaia for reporting this! Our team's triaging and debugging capacity is still ramping off, but we're looking into it.

@mmulji-ic shall I transfer this issue to comebft repo? Or would you like to handle that? I'm quite certain this is not Gaia specific.

adizere commented 1 year ago

Can also tag it with the cometbft label.

mmulji-ic commented 1 year ago

Hi @adizere , we still like to track this issue, could you open a new issue in the comet-bft repo and then link back to this issue. Added the comet-bft tag .

cason commented 1 year ago

The most likely reason is that described here: https://github.com/cometbft/cometbft/issues/486

cason commented 1 year ago

In short, when a node is short of peer addresses it dials the configured seed nodes. When receiving addresses back from a seed, the node immediately starts dialing the provided addresses. This "fast dialing" execution flow disregards the maximum outbound peers configuration flag.

To confirm this hypothesis, are the inbound or outbound peers exceeding the maximum configured bounds?

adizere commented 1 year ago

To confirm this hypothesis, are the inbound or outbound peers exceeding the maximum configured bounds?

@cason I'm not sure there is a way to distinguish from JSON/RPC /net_info calls between inbound or outbound peers. The n_peers result is an aggregate. Or is there another way?

@gaia Can you reproduce the issue and know how to distinguish between inbound or outbound peers?

cason commented 1 year ago

I am not sure, but the in the logs this information is printed every 30 seconds, INFO level, see here: https://github.com/cometbft/cometbft/blob/main/p2p/pex/pex_reactor.go#L457

gaia commented 1 year ago

I am not able to reproduce the issue currently. The max peers is being respected across several different clients, not just gaiad. Maybe it's because they've been running for a while and it's no longer dialing? I will check again later.

cat config.toml | egrep '_inbound|_outbound' curl -s http://localhost:26657/net_info | jq .result.peers[].is_outbound | grep false | wc -l curl -s http://localhost:26657/net_info | jq .result.peers[].is_outbound | grep true | wc -l

cason commented 1 year ago

Maybe it's because they've been running for a while and it's no longer dialing?

The situation I mentioned above only happens when the node dials a seed node. A node only dials a seed node when it is short of addresses, this can happens when the node is fresh and has no addresses at all on its address book and did not manage retrieve enough addresses from its initial peers (e.g. persistent peers).

gaia commented 10 months ago

happening now on Osmosis v18.0.0. Moving discussion to https://github.com/cometbft/cometbft/issues/486