facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.95k forks source link

Gradient checking fails for non-cuDNN GPU convolution wgrad #838

Open lukeyeager opened 7 years ago

lukeyeager commented 7 years ago

There are so many hypothesis configs for this test that it may take several tries before this error appears. Or you can increase the robustness of the test with @hypothesis.settings(max_examples=50).

$ pytest -sv caffe2/python/operator_test/conv_test.py::TestConvolution::test_1d_convolution_nchw
...
Failed. [idx, grad, grad_estimate] are:
[[ 0.         -0.04264414  0.25350109]
 [ 1.          0.00933766 -0.21303922]
 [ 2.          0.10659322  0.00899091]]
...
AssertionError: Gradient check failed for input w

$ pytest -sv caffe2/python/operator_test/conv_test.py::TestConvolution::test_3d_convolution_nchw
...
Failed. [idx, grad, grad_estimate] are: 
[[  7.           0.70774764   0.33553407]
 [ 14.           0.20726326  -0.16495049]
 [ 15.           0.35344854   0.26257008]]
...
AssertionError: Gradient check failed for input w
lukeyeager commented 7 years ago

1d convolution still fails with the same error.

3d convolution is possibly worse?

$ pytest -sv caffe2/python/operator_test/conv_test.py::TestConvolution::test_3d_convolution_nchw
Trying example: test_3d_convolution_nchw(self=<conv_test.TestConvolution testMethod=test_3d_convolution_nchw>, input_channels=2, output_channels=1, batch_size=1, stride=1, size=4, kernel=2, dilation=2, pad=0, u$e_bias=True, gc=<caffe2.proto.caffe2_pb2.DeviceOption at 0x7fcc28dfeb90>, dc=[<caffe2.proto.caffe2_pb2.DeviceOption at 0x7fcc28dfeb18>,                                                                            
 <caffe2.proto.caffe2_pb2.DeviceOption at 0x7fcc28dfeb90>])                                              
F0830 23:07:05.518461  7324 context_gpu.cu:357] Error at: /caffe2/caffe2/core/context_gpu.cu:357: an illegal memory access was encountered                                                                         

Aborted (core dumped)
asaadaldien commented 7 years ago

@lukeyeager Its a bug in im2col_nd_gpu_kernel. I tried to track it down but I wasn't able to get cuda-gdb to work with caffe2 which makes debugging the device code very hard.