Constellation-Labs / constellation

:milky_way::satellite: Decentralized Application Integration Platform
Apache License 2.0
153 stars 40 forks source link

Unable to create Snapshot when joining and gossip is enabled <=> configurable peer health check #1521

Open buckysballs opened 3 years ago

buckysballs commented 3 years ago

We've noticed on testnet that with the new gossip implementation enabled, sometimes a joining node is unable to begin making snapshots. Regardless of the root cause we need the L0 majority state selection process to be able to realize if its stuck. We can extend the current health check to provide custom logic for determining whether or not to remove a node. An example of such logic could be: if a peer repeatedly is not making snapshots, after a given height interval, peers initiate a peer health check round that either propagates the missing proposal or initiates a peer health check asking for the proposal from the potentially dead peer. We can essentially perform this check "per height" and gather unresponsive peers after a given height interval. This prevents loops coming from redownloading to a height below the current pending majority height.