Scheduler Restart Causes dfdaemon and seed-peer to Report resolve Errors

yantingqiu commented 1 month ago

Bug report:

When the scheduler restarts, the dfdaemon and seed-peer will report errors saying they can't resolve the scheduler. At the same time, the scheduler will also report errors saying it can't resolve the seed-peer.

In the database, the deleted scheduler remains in an active state, while the newly created scheduler stays in an inactive state.

Here are the relevant logs:

scheduler log:
scheduler.log
dfdaemon log:
dfdaemon.log
seed-peer log:
seed-peer.log
manager: manager.log

Expected behavior:

All components (scheduler, dfdaemon, and seed-peer) should be able to resolve each other correctly and operate without errors after the scheduler restarts.

How to reproduce it:

Deploy Dragonfly2 using Helm
Restart the scheduler Pod

Environment:

Dragonfly version:　2.1.30
OS:
Kernel (e.g. uname -a):
Others:

gaius-qi commented 1 month ago

@yantingqiu Don't use the same hostname.

yantingqiu commented 1 month ago

@yantingqiu Don't use the same hostname.

@gaius-qi When deploying Dragonfly2 in a containerized environment using StatefulSet, the Pods will retain their original hostnames after restarting, which seems difficult to change. Do you have any suggestions?

yantingqiu commented 1 month ago

Additionally, the Pods that are killed should be marked as inactive, but the database continuously updates them to active.

dragonflyoss / Dragonfly2