Problem about opencl backend in libtorch

johndoanee commented 10 months ago

Hi, I want to use opencl backend for AMD R5 5600g in libtorch 1.13 of cpu version. So I use the following code in my cpp file.

dlhandler = dlopen("/usr/lib/x86_64-linux-gnu/libpt_ocl.so", RTLD_NOW | RTLD_GLOBAL);

It occurs errors when I run it. What should I do to make things right? Thank you!

Accessing device #0:gfx90c:xnack- on AMD Accelerated Parallel Processing terminate called after throwing an instance of 'c10::Error' what(): Attempted to callvariable.set_data(tensor), butvariableandtensorhave incompatible tensor type. Exception raised from set_data at ./torch/csrc/autograd/variable.cpp:477 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9e (0x7f86b65e99ae in /lib/x86_64-linux-gnu/libc10.so.1.13) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xfc (0x7f86b65b68de in /lib/x86_64-linux-gnu/libc10.so.1.13) frame #2: <unknown function> + 0x3b6beaf (0x7f86ba191eaf in /lib/x86_64-linux-gnu/libtorch_cpu.so.1.13) frame #3: void torch::nn::Module::to_impl<c10::Device&, bool&>(c10::Device&, bool&) + 0x1d4 (0x7f86bab7e824 in /lib/x86_64-linux-gnu/libtorch_cpu.so.1.13) frame #4: void torch::nn::Module::to_impl<c10::Device&, bool&>(c10::Device&, bool&) + 0x88 (0x7f86bab7e6d8 in /lib/x86_64-linux-gnu/libtorch_cpu.so.1.13) frame #5: torch::nn::Module::to(c10::Device, bool) + 0x1c (0x7f86bab7a45c in /lib/x86_64-linux-gnu/libtorch_cpu.so.1.13) frame #6: main + 0xff (0x559ccc74fdc0 in ./mnist_train) frame #7: <unknown function> + 0x271ca (0x7f86b61aa1ca in /lib/x86_64-linux-gnu/libc.so.6) frame #8: __libc_start_main + 0x85 (0x7f86b61aa285 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: _start + 0x21 (0x559ccc74f9c1 in ./mnist_train)

artyom-beilis commented 10 months ago

Honestly, I never tested it with C++ API. Do you have a small example of C++ code that run some very-very simple torch code (like add two tensors) that lets say runs on CPU and does not run there?

johndoanee commented 10 months ago

I write two examples. The example in test.tar.gz is the simple code which add two tensors and has correct result. But the example in mnist.tar.gz has crashed when I use opencl backend. You could have a try. Thank you! test.tar.gz mnist.tar.gz

artyom-beilis commented 10 months ago

Ok, thanks. I reproduced the issue. Can't figure out what it is yet.

johndoanee commented 10 months ago

Hi, I try the matmul() function in libtorch, it gives these warning informations.

[W tensor_ops.cpp:324] Warning: The operator 'aten::addmm.out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (function fallback) [W tensor_ops.cpp:324] Warning: The operator 'aten::mm.out' is not currently supported on the ocl backend. Please open an issue at for requesting support https://github.com/artyom-beilis/pytorch_dlprim/issues (function fallback)

Could you please add these features? Thank you!

artyom-beilis commented 10 months ago

matmul is quite complicated - because it supports batched multiplication, broadcasting etc.

But adddmm and mm should be quite easy to do. Just need to get a time to implement.

artyom-beilis / pytorch_dlprim

Problem about opencl backend in libtorch #46