alibaba / x-deeplearning

An industrial deep learning framework for high-dimension sparse data
Apache License 2.0
4.25k stars 1.03k forks source link

SyncHook 没有在 worker 之间形成同步? #364

Closed nrailg closed 2 years ago

nrailg commented 2 years ago

用 deepctr 的 example 个改的。 worker 0 没有等待 worker 1 就开始算 loss 了,而 worker 1 必须等待 worker 0。 kill 掉 worker 0 有时候 worker 1 会 stuck 有时候不会。请问这符合预期吗?

def train():
    batch = reader.read()
    sess = xdl.TrainSession()
    emb1 = xdl.embedding('emb1', batch['sparse0'], xdl.TruncatedNormal(stddev=0.001), 8, 1024, vtype='hash')
    emb2 = xdl.embedding('emb2', batch['sparse1'], xdl.TruncatedNormal(stddev=0.001), 8, 1024, vtype='hash')
    loss = model(batch['deep0'], [emb1, emb2], batch['label'])
    train_op = xdl.SGD(0.5).optimize()
    log_hook = xdl.LoggerHook(loss, "loss:{0}", 10)
    hooks = [log_hook]

    sync_hook = xdl.SyncRunHook(xdl.get_task_index(), xdl.get_task_num())
    hooks.append(sync_hook)

    sess = xdl.TrainSession(hooks=hooks)
    while not sess.should_stop():
        sess.run(train_op)

    xdl.worker_report_finish_op(np.array(xdl.get_task_index(), dtype=np.int32))
python deepctr.py --task_name=scheduler --zk_addr=zfs://127.0.0.1:2181 --ps_num=1 --ps_cpu_cores=10 --ps_memory_m=4000 --ckpt_dir=./checkpoint

python deepctr.py --task_name=ps --task_index=0 --zk_addr=zfs://127.0.0.1:2181

python deepctr.py --task_name=worker --task_index=0 --task_num=2 --zk_addr=zfs://127.0.0.1:2181