Cleaning & scale up system & fault tolerance

EricDinging commented 1 year ago

[x] Proactive scheduling
- [x] Client assign return waiting time (avg round time), client repeatedly ping the job during waiting time
- [x] Proactive scheduling
[x] Client TTL
[ ] Clean up interface, let other use it, keep current framework clean
[ ] Implement a bare-metal distributed system
- [x] Load balancer
- [ ] Health check
[ ] Hot-standby job manager and scheduler
[ ] Scheduler client db selection strategy & optimization
[x] Client database sharding
[ ] Job database replication
[ ] Full FedScale integration (dataset download...and other type of task)
[ ] Logging and plotting
- [ ] OpenTelemetry

EricDinging commented 1 year ago

After proactive scheduling, CM has difficulty in knowing the success or failure of each client. So client utilization should be got from client side monitor
Need to distinguish no job and no eligible job case

EricDinging commented 1 year ago

It will be hard for clients to provide a TTL, since client condition is always fluctuating. The better way would be to let clients constanly ping Propius, or PS for tasks. Clients could leave at any time.

Assume clients only check in to the system once. The choice we have is let the clients continuously ping Propius until a task is available or quickly assign a task (whether it is actively demanding for clients or not) and let the client continuously ping the task/

The former choice would exert some burdens on the system, the latter would sublet the burden to the individual parameter server.

Client could provide a two TTL (there are default values), first used in Propius comm, second used in PS comm, propius would need the first TTL, its up to the job whether to use the second. When the first TTL use up, Propius will assign tasks, in addition to normal tasks if there are any, which are 1. Not actively allocating 2. Asking for proactively scheduled.

EricDinging / Propius

Cleaning & scale up system & fault tolerance #6