scheduler database contains same hostname in multiple "active" state

succa commented 1 month ago

Bug report:

mysql> select * from scheduler where host_name="dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local" order by id;

+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+
| id  | created_at          | updated_at          | is_del | host_name                                                   | idc  | location | ip              | port | state    | features               | scheduler_cluster_id |
+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+
| 102 | 2024-07-11 23:16:59 | 2024-07-24 19:17:41 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.76.106  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 113 | 2024-07-24 19:17:53 | 2024-08-01 17:42:18 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.21.36   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 141 | 2024-08-01 17:42:29 | 2024-08-13 15:07:32 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.82.18   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 157 | 2024-08-13 15:07:38 | 2024-08-19 13:43:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.154.131.24  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 173 | 2024-08-19 13:43:31 | 2024-08-23 00:34:01 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.28.247  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 187 | 2024-08-23 00:34:07 | 2024-08-26 10:34:39 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.156.224.51  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 194 | 2024-08-26 10:35:01 | 2024-09-04 06:43:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.188.97  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 206 | 2024-09-04 06:43:33 | 2024-09-10 23:27:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.154.180.220 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 217 | 2024-09-10 23:28:07 | 2024-09-10 23:28:07 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.63.74   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 218 | 2024-09-10 23:28:26 | 2024-09-13 01:39:23 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.112.122 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 228 | 2024-09-13 01:39:38 | 2024-09-21 02:37:27 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.125.176 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 237 | 2024-09-21 02:37:47 | 2024-09-24 17:47:28 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.159.130.24  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 249 | 2024-09-24 17:47:59 | 2024-09-25 02:29:57 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.96.143  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 254 | 2024-09-25 02:30:10 | 2024-10-03 14:26:41 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.105.34  | 8002 | active   | ["schedule","preheat"] |                    1 |
| 264 | 2024-10-03 17:39:02 | 2024-10-03 17:39:08 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.158.20.78   | 8002 | active   | ["schedule","preheat"] |                    1 |
| 265 | 2024-10-03 18:15:14 | 2024-10-03 20:14:48 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.153.85.210  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 266 | 2024-10-03 20:15:12 | 2024-10-09 19:01:06 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.158.244.225 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 283 | 2024-10-09 19:01:16 | 2024-10-10 09:02:42 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.231.191 | 8002 | active   | ["schedule","preheat"] |                    1 |
+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+

Notice also the strange jump in time between a old entry in error "active" state and subsequent entry This is preventing peers to use this scheduler pod because they are using a wrong ip.

Expected behavior:

There should be only one active entry per host_name at any point in time.

How to reproduce it:

Not able to reproduce, I guess it is happening when the scheduler database is being updated

Environment:

Dragonfly version: 2.1.50

gaius-qi commented 1 month ago

@succa This will happen if the scheduler instance is force deleted. Or this situation can also occur if the manager service is unavailable when the scheduler is deleted.

succa commented 1 month ago

@gaius-qi Thanks for the very quick answer! Is there a fix to it? My scheduler pods are not long live pods due to cluster node rotation

gaius-qi commented 1 month ago

@succa It is necessary to ensure that there are active instances of the manager during the upgrade scheduler process.

succa commented 1 month ago

@gaius-qi I have 10 running instances all the time. I ended up creating a cronjob to cleanup the database, but this is something you might want to consider adding in the code directly as a safe check by the manager

dragonflyoss / Dragonfly2