apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

got segfault from lenet with stn example #9050

Closed iblislin closed 6 years ago

iblislin commented 6 years ago

Hi, We encounter segfault with stn. Here is the original issue https://github.com/dmlc/MXNet.jl/issues/369.

TL;DR: Segfault happened in CPU-version mshadow::BilinearSamplingBackward

gdb trace here:

Thread 37 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff35a15700 (LWP 13819)]
0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=..., 
    input_data=...) at src/operator/spatial_transformer.cc:120
120                   *(g_input + data_index + 1) += *(grad + grad_index) * top_left_y_w
(gdb) bt
#0  0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=..., 
    input_data=...) at src/operator/spatial_transformer.cc:120
#1  0x00007fff83e5f18c in mxnet::op::SpatialTransformerOp<mshadow::cpu, float>::Backward (this=0x38bcd30, ctx=..., 
    out_grad=std::vector of length 1, capacity 1 = {...}, in_data=std::vector of length 2, capacity 2 = {...}, 
    out_data=std::vector of length 3, capacity 3 = {...}, req=std::vector of length 2, capacity 2 = {...}, 
    in_grad=std::vector of length 2, capacity 2 = {...}, aux_args=std::vector of length 0, capacity 0)
    at src/operator/./spatial_transformer-inl.h:136
(gdb) p grad
$1 = (const float *) 0x7fff251e6f90
(gdb) p top_left_y_w
$2 = 0.376614928
(gdb) p grad_index
$3 = 0
(gdb) p *(grad + grad_index)                                                                                              
$4 = 0.00177509966
(gdb) p g_input + data_index + 1
$5 = (float *) 0x80032442cf50
(gdb) p g_input
$6 = (float *) 0x7fff2442cf50
(gdb) p data_index
$7 = 4294967295

actually data_index become a negative number.

Also, segfault can reproduce in Python's example (with 1.0 prebuilt binary) (https://github.com/dmlc/MXNet.jl/issues/369#issuecomment-350617043)

./train_mnist.py --network lenet --add_stn --optimizer adam
sami-badawi commented 6 years ago

I get the this segfault:

Segmentation fault: 11

when running C++ version of cpp-package/example/lenet

This is where the segfault is thrown:

    Symbol conv1 =
        Convolution("conv1", data, conv1_w, conv1_b, Shape(5, 5), 20);

I have built it on a OS X 10.13.2 I disabled as many libraries as possible.

I have been able to run Python version of lenet when I installed it with pip.

haojin2 commented 6 years ago

@iblis17 Can you still reproduce the error with the latest code? I've tried out the python reproduction and verified this should be fixed already. If you can confirm that this bug is no longer appearing on your side would you mind closing the issue? Thanks!