salud IsHealthy using wrong radius

ldeffenb commented 1 month ago

Context

v2.1.0 (and earlier)

Summary

Several of my sepolia testnet nodes are not participating in the storage compensation rounds. All of these nodes have storage radius 4 while the remainder of the swarm has increased to radius 5. Radius 4 is CORRECT for these lesser-populated neighborhoods.

The nodes are logging:

"time"="2024-05-29 07:22:14.260110" "level"="info" "logger"="node/storageincentives" "msg"="skipping round because node is unhealhy" "round"=39473

and

"time"="2024-05-29 07:56:05.200700" "level"="warning" "logger"="node/salud" "msg"="node is unhealthy due to storage radius discrepency" "self_radius"=4 "network_radius"=5

Expected behavior

If a node has the same radius as its neighborhood peers, then it must be healthy, regardless of what the radius is in other neighborhoods.

Actual behavior

Because other neighborhoods in the swarm have increased to radius 5, the lesser-populated neighborhoods are not participating in the storage compensation.

Steps to reproduce

Just fire up a node in one of the lesser-populated, radius 4 sepolia testnet neighborhoods. Specifically (at this point in time): 0x480, 0xb80, 0xc80, 0xd00, 0xdef, 0xe80

Possible solution

Use a neighborhood radius calculation for health rather than the overall swarm which may be different.

Here's the /status/peers output of one of the affected nodes. 4635-status-peers.txt

ldeffenb commented 1 month ago

Here is the /status output for each of the radius 4 nodes/neighborhoods:

  "peer": "480...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4193860,
  "reserveSizeWithinRadius": 3426466,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "b80...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4193515,
  "reserveSizeWithinRadius": 3430038,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "c80...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4176960,
  "reserveSizeWithinRadius": 3810215,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "d00...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4093335,
  "reserveSizeWithinRadius": 4043948,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 2,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "def...
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4159288,
  "reserveSizeWithinRadius": 4043952,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 37,
  "neighborhoodSize": 2,
  "batchCommitment": 2757492736,
  "isReachable": true

  "peer": "e80...",
  "proximity": 0,
  "beeMode": "full",
  "reserveSize": 4181723,
  "reserveSizeWithinRadius": 3733244,
  "pullsyncRate": 0,
  "storageRadius": 4,
  "connectedPeers": 41,
  "neighborhoodSize": 0,
  "batchCommitment": 2757492736,
  "isReachable": true

If you compare those reserveSizeWithinRadius to the radius 5 nodes in the attached /status/peers file, you'll notice that the radius 4 have almost full reserves while the radius 5 nodes are only about 1/2 full; consistent with a recent increase of radius that didn't land uniformly across the swarm.

If this can happen in testnet and sustain for several days (as it has, until I noticed), then it can certainly happen in mainnet and be missed across the 1,024, 2,048, or even 4,096 neighborhoods.

ldeffenb commented 1 month ago

Interestingly, salud allows peers to be one less than the network radius (scroll right to see the -1): https://github.com/ethersphere/bee/blob/97e7ee699be3b4325a233b1ca2dc177cd88f17e1/pkg/salud/salud.go#L203 But requires itself to be equal to the network radius: https://github.com/ethersphere/bee/blob/97e7ee699be3b4325a233b1ca2dc177cd88f17e1/pkg/salud/salud.go#L225

ethersphere / bee