artyom-beilis / pytorch_dlprim

DLPrimitives/OpenCL out of tree backend for pytorch
http://blog.dlprimitives.org/
MIT License
264 stars 17 forks source link

Crash while trying mnist.py example #6

Closed teena3 closed 2 years ago

teena3 commented 2 years ago

Hi, I tried the steps mentioned getting crash while trying mnist.py example.

"gdb --args python3.7 mnist.py --device opencl:0" I get the following SIGSENV

Could you please have a look, following is the gdb backtrace:

Thread 1 "python3.7" received signal SIGSEGV, Segmentation fault. 0x00007fbb1a272648 in clRetainMemObject () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 (gdb) bt

0 0x00007fbb1a272648 in clRetainMemObject () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1

1 0x00007fbb1a59a485 in cl::detail::ReferenceHandler<_cl_mem*>::retain (memory=0x544d150) at /usr/include/CL/cl.hpp:1693

2 0x00007fbb1a59a468 in cl::detail::Wrapper<_cl_mem*>::retain (this=0x7ffd14e01c20) at /usr/include/CL/cl.hpp:1858

3 0x00007fbb1a605a4b in cl::detail::Wrapper<_cl_mem*>::operator= (this=0x7ffd14e01c20, rhs=...) at /usr/include/CL/cl.hpp:1824

4 0x00007fbb19fe0879 in cl::Memory::operator= (this=, mem=...) at /usr/include/CL/cl.hpp:3055

5 cl::Buffer::operator= (this=, buf=...) at /usr/include/CL/cl.hpp:3296

6 dlprim::Tensor::Tensor (this=0x7ffd14e01bf8, buffer=..., offset=, s=..., d=, is_train=)

at /root/dev/dlprimitives/src/tensor.cpp:57

7 0x00007fbb1a608000 in ptdlprim::todp (tt=...) at /root/dev/pytorch_dlprim/src/utils.cpp:56

8 0x00007fbb1a5dc6b1 in ptdlprim::convolution_overrideable (input=..., weight=..., bias=..., stride=..., padding=..., dilation=..., transposed=false,

output_padding=..., groups=1) at /root/dev/pytorch_dlprim/src/vision_ops.cpp:69

9 0x00007fbb1a5e400e in c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&, c10::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long> >::operator() (this=0x4ceb270, args=1,

args=1, args=1, args=1, args=1, args=1, args=1, args=1, args=1)
at /usr/local/lib/python3.7/dist-packages/torch/include/ATen/core/boxing/impl/WrapFunctionIntoRuntimeFunctor.h:18

10 0x00007fbb1a5e34b2 in c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor ()(at::Tensor const&, at::Tensor const&, c10::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) (functor=0x4ceb270, args=1, args=1, args=1, args=1, args=1, args=1, args=1, args=1, args=1)

at /usr/local/lib/python3.7/dist-packages/torch/include/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:423

clinfo --list Platform #0: NVIDIA CUDA `-- Device #0: Tesla K80

uname -a Linux 50f7d1dae2cb 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Commit ID for pytorch_dlprim 7ec2e47cd56fdad86e08d3aff65f7c35fc89b575

Commit ID for dlprimitives 6eb5794aec7b48fe2e2b8d1fa7b1eab712d72d87

Commit ID for pytorch eb74af18af6e90ae47f24997af8468bf7b9deb72 BUILD CMD:USE_CUDA=0 BUILD_BINARY=OFF BUILD_TEST=0 BUILD_CAFFE2_OPS=0 BUILD_CAFFE2=ON USE_FBGEMM=ON python3.7 setup.py install

mnist_crash.txt clinfo.txt Please let me know if you need any more information.

artyom-beilis commented 2 years ago

Make sure you compile dlprimitives with cl2.h and not cl.h.

What are the flags are you using for dlprimitives build?

The older cl.h fails with pytorch for some reason I d

artyom-beilis commented 2 years ago

Exactly from the backtrace I see you use cl.hpp

 /usr/include/CL/cl.hpp:3296
teena3 commented 2 years ago

Make sure you compile dlprimitives with cl2.h and not cl.h.

Thank you for the support, this worked. Made sure to use cl2.hpp instead of cl.hpp, was able to run mnist.py example :)

artyom-beilis commented 2 years ago

Great. I think I'll add a protection on build level such that dlprimitives is build with cl.hpp the build of pytorch backend will fail.

artyom-beilis commented 2 years ago

Added protection to prevent building against dlprimitives using cl.hpp 58de5cfd922f0d7d11f5b5da094b27c82d095519