EricDinging / Propius

Collaborative machine learning (federated learning) resource manager
https://ericdinging.github.io/file/propius.pdf
MIT License
0 stars 0 forks source link

Cleaning & scale up system & fault tolerance #6

Open EricDinging opened 1 year ago

EricDinging commented 1 year ago
EricDinging commented 1 year ago
  1. After proactive scheduling, CM has difficulty in knowing the success or failure of each client. So client utilization should be got from client side monitor
  2. Need to distinguish no job and no eligible job case
EricDinging commented 1 year ago

It will be hard for clients to provide a TTL, since client condition is always fluctuating. The better way would be to let clients constanly ping Propius, or PS for tasks. Clients could leave at any time.

Assume clients only check in to the system once. The choice we have is let the clients continuously ping Propius until a task is available or quickly assign a task (whether it is actively demanding for clients or not) and let the client continuously ping the task/

The former choice would exert some burdens on the system, the latter would sublet the burden to the individual parameter server.

Client could provide a two TTL (there are default values), first used in Propius comm, second used in PS comm, propius would need the first TTL, its up to the job whether to use the second. When the first TTL use up, Propius will assign tasks, in addition to normal tasks if there are any, which are 1. Not actively allocating 2. Asking for proactively scheduled.