PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

训练报错paddle.fluid.core_avx.EnforceNotMet: Invoke operator reshape2 error #19696

Closed szqxx closed 5 years ago

szqxx commented 5 years ago

   1)PaddlePaddle版本:1.5.2.post10.7    3)GPU:NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 CUDNN 7.0    4)系统环境:Centos OS 7,Python 3.7.2


让人疑惑的是最后的报错信息居然是`but received output_shape[unk_dim_idx] * capacity:0 != -in_size:-267840`,这里等号左边为0意味着什么呢?

构建网络代码段如下:
def build_model(self, image_shape, backbone):
    self.build_input(image_shape)
    snet = SNet(backbone)
    c4,c5,cglb = snet.net(self.image)
    cem = context_enhancement_module(c4,c5,cglb)
    # 5x5 dw and 1x1 conv
    rpn = depthwise_separable(cem,
    245, 256, filter_size=5, groups=1, stride=1, scale=1,
    name='rpn')
    print('rpn.shape = ',rpn.shape)
    sam = spatial_attention_module(cem,rpn)
    print('sam.shape = ',sam.shape)
    # RPN
    self.rpn_heads(rpn)
    # Fast RCNN
    self.fast_rcnn_heads(sam)

现在不便把所有代码段公开,我能保证自己复现network没问题,但在feed或者train这里报错了,还请指点~多谢
sneaxiy commented 5 years ago

您是否跑了startup program?

szqxx commented 5 years ago

@sneaxiy有的,完全仿照 rcnn/train.py

    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
    place = fluid.CUDAPlace(gpu_id) if cfg.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
sneaxiy commented 5 years ago

尝试加环境变量GLOG_vmodule=operator=4来打log,然后把log贴出来?

szqxx commented 5 years ago

好的 谢谢 会尽快尝试回复

szqxx commented 5 years ago

@sneaxiy 用GLOG_vmodule=operator=4除了多一些shape信息之外,并没有有用的其他输出。

报错的原因好像是paddle reshape的一处内部错误。

def channel_shuffle(x):
    print("before channel_shuffle, shape is",x.shape)
    batchsize, num_channels, height, width = x.shape[0], x.shape[1], x.shape[2], x.shape[3]
    channels_per_group = num_channels // 2

    # reshape
    x = fluid.layers.reshape(x=x, shape=[batchsize, 2, channels_per_group, height, width])

    x = fluid.layers.transpose(x=x, perm=[0,2,1,3,4])

    # flatten
    x = fluid.layers.reshape(x=x, shape=[batchsize, num_channels, height, width])

    return x

这是paddle的通道混合操作,报错就是从这里开始的,我之前的确在issue版块搜到几个类似报错的情况,被卡住很久。

如果可以的话,恳请大佬私信check下代码,万分感谢!

szqxx commented 5 years ago

@sneaxiy 屏蔽这个channel_shuffle操作就可以训练了,虽然破坏了模型,但是可以跑起来了,感谢~