PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.92k stars 2.91k forks source link

paddle.fluid.core_avx.EnforceNotMet: Invoke operator mul_grad error #3348

Open lxk1990727 opened 5 years ago

lxk1990727 commented 5 years ago

多卡训练过程中报错paddle.fluid.core_avx.EnforceNotMet: Invoke operator mul_grad error

Traceback (most recent call last):
328   File "train.py", line 165, in <module>
329     train(args)
330   File "train.py", line 134, in train
331     infer_outs = exe.run(compiler_prog, fetch_list=fetch_list)
332   File "/home/work/lixiaokang04/tools/paddle_release_home/python/lib/python2.7/site-packages/paddle/fluid/executor.py", line 666, in run
333     return_numpy=return_numpy)
334   File "/home/work/lixiaokang04/tools/paddle_release_home/python/lib/python2.7/site-packages/paddle/fluid/executor.py", line 528, in _run_p    arallel
335     exe.run(fetch_var_names, fetch_var_name)
336 paddle.fluid.core_avx.EnforceNotMet: Invoke operator mul_grad error.
337 Python Callstacks:
338   File "/home/work/lixiaokang04/tools/paddle_release_home/python/lib/python2.7/site-packages/paddle/fluid/framework.py", line 1771, in appe    nd_op
339     attrs=kwargs.get("attrs", None))
340   File "/home/work/lixiaokang04/tools/paddle_release_home/python/lib/python2.7/site-packages/paddle/fluid/layer_helper.py", line 43, in app    end_op
341     return self.main_program.current_block().append_op(*args, **kwargs)
342   File "/home/work/lixiaokang04/tools/paddle_release_home/python/lib/python2.7/site-packages/paddle/fluid/layers/nn.py", line 334, in fc
343     "y_num_col_dims": 1})
344   File "/home/work/lixiaokang04/data/ernie/vvt_ernie_embs/models/video_text/tsn_res_model.py", line 158, in net
345     size=output_dim, bias_attr=False)
346   File "/home/work/lixiaokang04/data/ernie/vvt_ernie_embs/models/video_text/video_text.py", line 157, in build_model
347     self.video_emb_neg = videomodel.net(input = self.feature_input[7], output_dim=cfg['tsn_output_size'])
348   File "train.py", line 99, in train
349     train_model.build_model()
350   File "train.py", line 165, in <module>
351     train(args)
352 C++ Callstacks:
353 The places of matrices must be same at [/paddle/paddle/fluid/operators/math/blas_impl.h:392]
354 PaddlePaddle Call Stacks:
355 0       0x7f953cc9aad0p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 352
356 1       0x7f953cc9ae49p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137
357 2       0x7f953d42a46cp void paddle::operators::math::Blas<paddle::platform::CUDADeviceContext>::MatMul<float>(paddle::framework::Tensor co    nst&, bool, paddle::framework::Tensor const&, bool, float, paddle::framework::Tensor*, float) const + 412
Xreki commented 5 years ago

你这个错误,多半原因是:训练是在GPU上做的,所以参数都是保存在GPU上,预测却在CPU上做,所以报错。请贴一下相关的代码吧。

lxk1990727 commented 5 years ago
  1. 我任务只有train,没有inference;2. 我改成单卡是OK的。
def train(args):
    # parse config
    config = parse_config(args.config)
    train_config = merge_configs(config, 'train', vars(args))
    train_model = models.get_model(args.model_name, train_config, 0.1, mode='train')

    #compiled_prog = compiler.CompiledProgram(train_prog).with_data_parallel(loss_name=loss.name)

    # build model
    startup = fluid.Program()
    train_prog = fluid.Program()
    with fluid.program_guard(train_prog, startup):
        with fluid.unique_name.guard():
            train_model.build_input(not args.no_use_pyreader)
            train_model.build_model()
            train_feeds = train_model.feeds()
            train_outputs = train_model.outputs()
            train_pyreader = train_model.pyreader()

    compiler_prog = fluid.compiler.CompiledProgram(train_prog).with_data_parallel(loss_name=train_outputs[0].name)

    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)
    exe.run(startup)

    assert os.path.exists(args.resume), \
            "model dir {} not exist.".format(args.resume)
    def if_exist(var):
        return os.path.exists(os.path.join(args.resume, var.name))
    fluid.io.load_vars(exe, args.resume, predicate=if_exist, main_program=train_prog)

    train_reader = get_reader(args.model_name, 'train', train_config, place)

    fetch_list = [x.name for x in train_outputs]
    train_pyreader.decorate_tensor_provider(train_reader)

    for epoch_id in range(1, 100):
        train_pyreader.start()
        train_iter = 0
        try:
            loss_step = []
            while True:
                infer_outs = exe.run(compiler_prog, fetch_list=fetch_list)
                loss = np.array(infer_outs[0])
                pos_dis_sum = np.array(infer_outs[1])
                neg_dis_sum = np.array(infer_outs[2])
                loss_p = np.array(infer_outs[3])

                loss_step.append(loss[0])
                train_iter += 1
                if train_iter % 10 == 0:
                    cur_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
                    print(cur_time + " epoch %d, Batch %d, loss %f" % (epoch_id, train_iter, sum(loss_step)))
                    print(pos_dis_sum)
                    print(neg_dis_sum)
                    print(loss_p)
                    loss_step = []
                if train_iter % 10 == 0:
                    fluid.io.save_persistables(exe, "/ssd3/lixiaokang04/repr/cm_paddle_"+str(epoch_id)+'_'+str(train_iter), main_program=train_prog)
        except fluid.core.EOFException:
            pass
        finally:
            train_pyreader.reset()
        fluid.io.save_persistables(exe, "/ssd3/lixiaokang04/repr/cm_paddle_"+str(epoch_id), main_program=train_prog)
shippingwang commented 5 years ago

compiler_prog = fluid.compiler.CompiledProgram(train_prog).with_data_parallel(loss_name=train_outputs[0].name)

这句话放在
fluid.io.load_vars(exe, args.resume, predicate=if_exist, main_program=train_prog) 后面试试?

lxk1990727 commented 5 years ago

还是同样的问题,如果我不load checkpoint是可以跑通的。

shippingwang commented 5 years ago

继续训练ckpt,用下load_persistable接口试试?

lxk1990727 commented 5 years ago

这样都load不进去参数的

shippingwang commented 5 years ago

fluid.io.load_persistables(exe, args.resume, main_program=train_prog)

lxk1990727 commented 5 years ago

Cannot open file checkpoints/VideoText_epoch4_steps40000/fc_1.w_0 for load op at [/paddle/paddle/fluid/operators/load_op.h:37]

shippingwang commented 5 years ago

是对的预训练模型么?跑下eval看下

lxk1990727 commented 5 years ago

我单卡跑着一点问题都没有