dragonflyoss / Dragonfly2

Dragonfly is an open source P2P-based file distribution and image acceleration system. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project.
https://d7y.io
Apache License 2.0
2.1k stars 263 forks source link

Scheduler detects peer survivability and clears offline peer metadata. #3350

Closed BruceAko closed 1 day ago

BruceAko commented 1 week ago

Description

When a peer goes offline normally, the scheduler will be notified through the RPC interface to clear the offline peer metadata. However, if the peer exits abnormally, the scheduler cannot clear the peer's metadata in time, which may cause scheduling failure.

We need to support a P2P scenario where the scheduler proactively detects the survivability of peer nodes. The scheduler can clear the offline peer metadata even if the peer is released abnormally and the scheduler is not aware of it.

Solution

  1. The hostTTL is changed to be passed from dfdaemon to the scheduler via announce (instead of removing the hostTTL from the gc of the scheduler profile, it is used as a backdoor logic to maintain backward compatibility). That is, dfdaemon periodically adds an attribute to the host information broadcast to the scheduler: interval. HostTTL is calculated from the broadcast interval schedulerInterval, that is, twice the broadcast interval.
  2. Decrease the default gc interval in the HostManger configuration of the scheduler (6h -> 5m) in order to detect abnormal host exits in time.
  3. Add logic to HostManger's RunGC(): check all hosts and call LeaveHost() for the host if the time since last update exceeds twice the announcing interval. Keep the logic about HostTTL in PeerManager's RunGC() as a backing.

Example

The broadcast interval configured by dfdaemon is 300 seconds, so every time it broadcasts, dfdaemon will inform the scheduler about the broadcast interval, and the scheduler will record the hostTTL of the host as 600 seconds. hostManger will perform GC every 300 seconds by default, and when it does so, it will determine whether the host has been updated for more than 600 seconds since the last update (theoretically, the update interval is 300 seconds to allow for redundancy in network latency). And if it is more than 600 seconds, the host will be considered as an abnormal exit, and the metadata will be cleared.