训练报错paddle.fluid.core_avx.EnforceNotMet: Invoke operator reshape2 error

szqxx commented 5 years ago

1）PaddlePaddle版本：1.5.2.post10.7 3）GPU：NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 CUDNN 7.0 4）系统环境：Centos OS 7，Python 3.7.2

训练信息 1）单卡 V100 2）显存信息 16G
复现信息：复现Thundernet，网络每层shape打印没问题，参考faster-RCNN修改的，训练报错

报错日志：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段


Traceback (most recent call last):
File "train.py", line 248, in <module>
train()
File "train.py", line 240, in train
train_loop()
File "train.py", line 212, in train_loop
outs = train_exe.run(feed=feeder.feed(data), fetch_list=[v.name for v in fetch_list])
File "/home/wangbh/miniconda3/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 280, in run
return_numpy=return_numpy)
File "/home/wangbh/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 672, in run
return_numpy=return_numpy)
File "/home/wangbh/miniconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 534, in _run_parallel
exe.run(fetch_var_names, fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: Invoke operator reshape2 error.
Python Call stacks: 
File "/home/wangbh/miniconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 1774, in append_op
attrs=kwargs.get("attrs", None))
File "/home/wangbh/miniconda3/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/wangbh/miniconda3/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 6840, in reshape
"XShape": x_shape})
File "/home/wangbh/astar_detection/rcnn_test/utils.py", line 91, in channel_shuffle
x = fluid.layers.reshape(x=x, shape=[batchsize, 2, channels_per_group, height, width])
File "/home/wangbh/astar_detection/rcnn_test/models/snet.py", line 187, in inverted_residual_unit
return channel_shuffle(out)
File "/home/wangbh/astar_detection/rcnn_test/models/snet.py", line 62, in net
semodule=False, use_res_connect=False, name=str(idxstage+2)+'_'+str(i+1))
File "/home/wangbh/astar_detection/rcnn_test/models/model_builder.py", line 43, in build_model
c4,c5,cglb = snet.net(self.image)
File "train.py", line 80, in train
model.build_model(image_shape, backbone = 'SNet535')
File "train.py", line 248, in <module>
train()
C++ Call stacks: 
Enforce failed. Expected output_shape[unk_dim_idx] * capacity == -in_size, but received output_shape[unk_dim_idx] * capacity:0 != -in_size:-267840.


让人疑惑的是最后的报错信息居然是`but received output_shape[unk_dim_idx] * capacity:0 != -in_size:-267840`，这里等号左边为0意味着什么呢？

构建网络代码段如下：

def build_model(self, image_shape, backbone):
    self.build_input(image_shape)
    snet = SNet(backbone)
    c4,c5,cglb = snet.net(self.image)
    cem = context_enhancement_module(c4,c5,cglb)
    # 5x5 dw and 1x1 conv
    rpn = depthwise_separable(cem,
    245, 256, filter_size=5, groups=1, stride=1, scale=1,
    name='rpn')
    print('rpn.shape = ',rpn.shape)
    sam = spatial_attention_module(cem,rpn)
    print('sam.shape = ',sam.shape)
    # RPN
    self.rpn_heads(rpn)
    # Fast RCNN
    self.fast_rcnn_heads(sam)


现在不便把所有代码段公开，我能保证自己复现network没问题，但在feed或者train这里报错了，还请指点~多谢

sneaxiy commented 5 years ago

您是否跑了startup program？

szqxx commented 5 years ago

@sneaxiy有的，完全仿照 rcnn/train.py

    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
    place = fluid.CUDAPlace(gpu_id) if cfg.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())

sneaxiy commented 5 years ago

尝试加环境变量GLOG_vmodule=operator=4来打log，然后把log贴出来？

szqxx commented 5 years ago

好的谢谢会尽快尝试回复

szqxx commented 5 years ago

@sneaxiy 用GLOG_vmodule=operator=4除了多一些shape信息之外，并没有有用的其他输出。

报错的原因好像是paddle reshape的一处内部错误。

def channel_shuffle(x):
    print("before channel_shuffle, shape is",x.shape)
    batchsize, num_channels, height, width = x.shape[0], x.shape[1], x.shape[2], x.shape[3]
    channels_per_group = num_channels // 2

    # reshape
    x = fluid.layers.reshape(x=x, shape=[batchsize, 2, channels_per_group, height, width])

    x = fluid.layers.transpose(x=x, perm=[0,2,1,3,4])

    # flatten
    x = fluid.layers.reshape(x=x, shape=[batchsize, num_channels, height, width])

    return x

这是paddle的通道混合操作，报错就是从这里开始的，我之前的确在issue版块搜到几个类似报错的情况，被卡住很久。

如果可以的话，恳请大佬私信check下代码，万分感谢！

szqxx commented 5 years ago

@sneaxiy 屏蔽这个channel_shuffle操作就可以训练了，虽然破坏了模型，但是可以跑起来了，感谢~

PaddlePaddle / Paddle

训练报错paddle.fluid.core_avx.EnforceNotMet: Invoke operator reshape2 error #19696