CNevd / Difacto_DMLC

Distributed FM and LR based on Parameter Server with Ftrl
128 stars 76 forks source link

How the training or prediction be started? #7

Closed formath closed 8 years ago

formath commented 8 years ago

I found the training or prediction is started in the last of

void StartDispatch() namely,

    /*ask all workers to start by sending an empty workload*/
    Workload wl;
    SendWorkload(ps::kWorkerGroup, wl);

But if the workload in the request sent to worker node is empty, worker node just jump out the process method and response nothing. And then the scheduler node could't receive response so not to assign new training or prediction workload to workers. So, how the training or prediction be started? I'm a little confused about this. Thanks for your response.

CNevd commented 8 years ago

But if the workload in the request sent to worker node is empty, worker node just jump out the process method and response nothing------nope, worker will reply an empty ACK message to the scheduler, see executor.cc

formath commented 8 years ago

@CNevd hidden so deep. I found in DataParWorker::ProcessRequest if (wl.Empty()) return; If worker jump out the function ProcessRequest, when the ACK been sent? And I found scheduler will check the response in DataParScheduler::ProcessResponse. If just ACK be sent but not response, how the scheduler assign new workload to workers?

virtual void ProcessResponse(ps::Message* response) {
    DataParCmd cmd(response->task.cmd());
    if (!cmd.process()) return;
    ....
    //assign new workload
    ....
}
CNevd commented 8 years ago

the reply msg is sent after ProcessRequest is done, check the function Executor::Reply you'll find the reply msg contains the same cmd, so DataParScheduler::ProcessResponse will execute

 // a worker finished a workload, assign it a new one if available
    pool_.Finish(id);
    Workload wl = workload_; pool_.Get(id, &wl);
    if (wl.Empty()) return;
    wl.file[0].format = data_format_;
    SendWorkload(id, wl);