dmlc / ps-lite

A lightweight parameter server interface
http://ps-lite.readthedocs.org
Apache License 2.0
1.54k stars 542 forks source link

[BSP model in ps-lite] Discussion about BSP model implementation using PS-lite #143

Open authwork opened 5 years ago

authwork commented 5 years ago

I have surveyed lots of projects using ps-lite to implement BSP model. Most of them simply behave like:

kv.wait(kv.push)
kv.wait(kv.pull)

I do not think they are real BSP model because each worker only wait for the accomplishment of its own push (not other workers)

Based on the test_simple_app and docs/overview.md, the BSP way should be:

Scheduler

/* The code also shows why the scheduler cannot easily implement SSP or some other complicated models because it uses wait to know the progress of each worker.
In fact, you can using a big table to store all timestamp~(s*N), and when entering the (s+1)-th iteration, you need to wait for timestamps of all workers at the 1-st itertaion. This is similiar to SSP model, but is not efficiect,
*/
 if (IsScheduler()) {
    std::vector<int> ts;
    for (int i = 0; i < n; ++i) {
        ts.clear()
        for(worker in workergroup): 
             ts.push_back(app.Request(head, "body", receive_id)) // worker_id=i*2+9, see WorkerRankToID, this step needs to be confirmed.
        for(int t : ts) 
             app.Wait(t);
        //If this can broadcast the request to all workers, these two step may be simply rewrite as :
        //app.Wait(app.Request(head, "body", kWorkerGroup)) 
    }
 }

Server

   server->set_request_handle(KVServerDefaultHandle<float>()); //using the default

Worker

   worker->set_request_handle(request_handle)
   request_handle(){
        // we can check the head and body sent from scheduler
        Read(&X, &Y);  // read minibatch with b / num_workers examples
        kv.wait(kv.Pull(&w));      // pull the recent weight from the servers
        ComputeGrad(X, Y, w, &grad);  // compute the gradient
        kv.wait(kv.Push(grad)); // push my update to server
        worker->Response(req); //response to scheduler.
   }

I think the overall logic is similar to the BSP SGD described in the docs/overview

mli commented 5 years ago

on the server side, it will wait all workers' data to merge them before sending back ACK for workers' requests

authwork commented 5 years ago

on the server side, it will wait all workers' data to merge them before sending back ACK for workers' requests

Exactly, but this is the description of BSP model. (It means the server needs to wait all workers' data)

In the real implementation, we need to use the scheduler to manage the data synchronization (see here) without changing the KVServerDefaultHandle.

// WaitAllFinished(); 
for(int t : ts) 
      app.Wait(t); //wait all workers finish push