dmlc / ps-lite

A lightweight parameter server interface
http://ps-lite.readthedocs.org
Apache License 2.0
1.54k stars 542 forks source link

Can not train a model with multi_process kvworkers #182

Open kangshantong opened 3 years ago

kangshantong commented 3 years ago

Hi, I am trying to train models with ps-lite. It works well in multi_thread mode like test_kv_app_multi_workers, but in multi_process model, only one worker process works and the others are blocked in the PS::Start stage.

Trace the code of ps:Start, we can find that there is a barrier in this stage.After all the scheduler/servers/workers shoot the barrier command, every node will be activated by setting the barrierdone to true. But the code followed below only will set the barrierdone to true for customer_id 0.

void Postoffice::Manage(const Message& recv) { CHECK(!recv.meta.control.empty()); const auto& ctrl = recv.meta.control; if (ctrl.cmd == Control::BARRIER && !recv.meta.request) { barriermu.lock(); auto size = barrierdone[recv.meta.app_id].size(); _for (size_t customer_id = 0; customer_id < size; customer_id++) { barrierdone[recv.meta.app_id][customerid] = true; } barriermu.unlock(); barriercond.notify_all(); } }

kangshantong commented 3 years ago

The bug can be fixed by the code followed.

void Postoffice::Manage(const Message& recv) { CHECK(!recv.meta.control.empty()); const auto& ctrl = recv.meta.control; if (ctrl.cmd == Control::BARRIER && !recv.meta.request) { barriermu.lock(); _for (auto iter=barrierdone[recv.meta.app_id].begin();iter!=barrierdone[recv.meta.app_id].end(); iter++) { size_t customer_id = iter -> first; barrierdone[recv.meta.app_id][customerid] = true; } barriermu.unlock(); barriercond.notify_all(); } }

kangshantong commented 3 years ago

@eric-haibin-lin can you review this commit?