MPI分布式训练test_program初始化问题

shuoyin commented 5 years ago

Paddle版本：1.5 训练环境MPI集群采用preserver-trainer模式训练模型，其中包含一个分布式embedding词表，program创建代码为

with fluid.program_guard(train_program, startup_program):
        with fluid.unique_name.guard():
            graph_wrapper = pgl.graph_wrapper.GraphWrapper(
                "sub_graph", place, node_feat=data['graph'].node_feat_info())
            model_loss, recall, precision,all_acc = build_graph_model(
                graph_wrapper,
                hidden_size=args.hidden_size,
                graphsage_type=args.graphsage_type,
                k_hop=len(samples),num_node=data['graph'].num_nodes)
            test_program = train_program.clone(for_test=True)

    with fluid.program_guard(train_program, startup_program):
        with fluid.unique_name.guard():
            adam = fluid.optimizer.SGD(learning_rate=args.lr)
            adam.minimize(model_loss)

训练代码如下


for epoch in range(args.epoch):
            batch = 0
            start = time.time()
            start_batch = time.time()
            batch_time = 0
            for batch_feed_dict in train_iter():
                end_batch = time.time()
                batch_time += (end_batch-start_batch)
                batch += 1
                if batch%100==0: #print train log
                    outs = exe.run(train_program, feed=batch_feed_dict,
                            fetch_list=fetch_list)
                    end = time.time()
                    log.info('epoch: %d, batch: %d, loss: %f, recall: %f, precision: %f, all_acc: %f, avg time: %f, avg read time: %f'% \
                            (epoch, batch, outs[0], outs[1], outs[2], outs[3], (end-start)/100.0, batch_time/batch))
                    start = time.time()
                else:
                    exe.run(train_program, feed=batch_feed_dict)
                if batch%100==0:
                    recall = run_val(val_iter, exe, test_program, 'eval', fetch=fetch_list)
                    if recall > best_recall:
                        best_recall = recall
                        fluid.io.save_persistables(exe, './checkpoint', train_program)
                    start = time.time()
                start_batch = time.time()
```python
可以正常训练但是在run_val的时候报错paddle.fluid.core_avx.EnforceNotMet: Invoke operator lookup_table error.
以及
C++ Callstacks: 
holder_ should not be null
Tensor not initialized yet when Tensor::type() is called. at [/paddle/paddle/fluid/framework/tensor.h:139]
[任务链接] (http://10.73.201.14:8910/fileview.html?type=logsdir&path=/&instance=5.app-user-20190911150613-6320--yinshuo01_test_paddle)
谢谢！

JiabinYang commented 5 years ago

可以贴一下你的run_val是怎么跑的么

shuoyin commented 5 years ago

def run_val(batch_iter,
              exe,    
              program,
              prefix, 
              fetch): 
    batch = 0
    total_loss = 0.
    total_acc, total_recall, total_precision = 0.0, 0.0, 0.0
    total_sample = 0
    start = time.time()
    for batch_feed_dict in batch_iter():
        batch += 1
        batch_loss, recall, precision, all_acc = exe.run(program,fetch_list=fetch,feed=batch_feed_dict)

        num_samples = len(batch_feed_dict["node_index"])
        total_loss += batch_loss * num_samples
        total_acc += all_acc * num_samples
        total_recall += recall * num_samples
        total_precision += precision * num_samples
        total_sample += num_samples
    end = time.time()

    log.info("%s Loss %.5lf recall %.5lf precision %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
             (prefix, total_loss / total_sample, total_recall / total_sample,
              total_precision / total_sample, total_acc / total_sample, (end - start) / batch)) 
    return total_recall

shuoyin commented 5 years ago

我把embedding词表从分布式改成了非分布式之后就能跑了，所以可能是分布式词表的问题。请问在保留分布式词表的情况下应该怎么解决这个问题，因为我们词表比较大。 6afc58cd6bb984006cb9fd65a504f861

seiriosPlus commented 5 years ago

具体可以把case发出来，看看如何解决

paddle-bot-old[bot] commented 4 years ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

PaddlePaddle / Paddle

MPI分布式训练test_program初始化问题 #19762