dragonflyoss / Dragonfly2

Dragonfly is an open source P2P-based file distribution and image acceleration system. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project.
https://d7y.io
Apache License 2.0
2.1k stars 264 forks source link

Scheduler Restart Causes dfdaemon and seed-peer to Report resolve Errors #3297

Closed yantingqiu closed 1 month ago

yantingqiu commented 1 month ago

Bug report:

When the scheduler restarts, the dfdaemon and seed-peer will report errors saying they can't resolve the scheduler. At the same time, the scheduler will also report errors saying it can't resolve the seed-peer.

In the database, the deleted scheduler remains in an active state, while the newly created scheduler stays in an inactive state. image

Here are the relevant logs:

  1. scheduler log:
    scheduler.log

  2. dfdaemon log:
    dfdaemon.log

  3. seed-peer log:
    seed-peer.log

  4. manager: manager.log

Expected behavior:

All components (scheduler, dfdaemon, and seed-peer) should be able to resolve each other correctly and operate without errors after the scheduler restarts.

How to reproduce it:

  1. Deploy Dragonfly2 using Helm
  2. Restart the scheduler Pod

Environment:

gaius-qi commented 1 month ago

@yantingqiu Don't use the same hostname.

yantingqiu commented 1 month ago

@yantingqiu Don't use the same hostname.

@gaius-qi When deploying Dragonfly2 in a containerized environment using StatefulSet, the Pods will retain their original hostnames after restarting, which seems difficult to change. Do you have any suggestions?

yantingqiu commented 1 month ago

Additionally, the Pods that are killed should be marked as inactive, but the database continuously updates them to active.