ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

Crash when using softmax #1078

Closed abcdrm closed 6 months ago

abcdrm commented 7 months ago

Use arm_compute-v23.08-bin-android-arm64-v8.2-a-neon-cl.tar.gz downloaded from release page:

Platform: Qualcomm 8475

Operating System: Android,use aarch64-linux-android21-clang++ in android-ndk-r25b to compile the program.

    CLScheduler::get().default_init();
    CLTensor cl_in;
    CLTensor cl_out;
    cl_in.allocator()->init(TensorInfo(TensorShape(32, 32), 1, DataType::F16));
    cl_in.allocator()->allocate();
    cl_out.allocator()->init(TensorInfo(TensorShape(32, 32), 1, DataType::F16));
    cl_out.allocator()->allocate();

    std::cout << "cl_in.info()->total_size() = " << cl_in.info()->total_size() << std::endl;

    cl_in.map(true);
    auto ptr = reinterpret_cast< uint16_t* >(cl_in.buffer());
    for (int i = 0; i < 32 * 32; ++i) {
        ptr[i] = 10000 + i;
    }
    cl_in.unmap();

    CLSoftmaxLayer softmax_op_;
    softmax_op_.configure(&cl_in, &cl_out, 1.0f, 0);
    softmax_op_.run();
    CLScheduler::get().sync();

    cl_out.map(true);
    ptr = reinterpret_cast< uint16_t* >(cl_out.buffer());
    for (int i = 0; i < 32; ++i) {
        for (int j = 0; j < 32; ++j)
            std::cout << ptr[i * 32 + j] << std::endl;

        std::cout << "===============" << std::endl;
    }

Problem description: GDB back trace:

Thread 1 "ocl_info" received signal SIGABRT, Aborted.
0x0000007ff6547c48 in abort () from /apex/com.android.runtime/lib64/bionic/libc.so
(gdb) bt
#0  0x0000007ff6547c48 in abort () from /apex/com.android.runtime/lib64/bionic/libc.so
#1  0x0000007ff6ae4ff8 [PAC] in ?? () from /system/lib64/libc++_shared.so
#2  0x0000007ff6ae45dc in ?? () from /system/lib64/libc++_shared.so
#3  0x0000007ff6ae43f4 in __gxx_personality_v0 () from /system/lib64/libc++_shared.so
#4  0x00000055563f1c24 in unwind_phase2 ()
#5  0x00000055563f1cf4 [PAC] in _Unwind_Resume ()
#6  0x00000055563a3044 [PAC] in cl::Program::build(char const*, void (*)(_cl_program*, void*), void*) const
    ()
#7  0x00000055563a0958 in arm_compute::Program::build(cl::Program const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&) ()
#8  0x00000055563a0f98 in arm_compute::CLCompileContext::create_kernel(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::set<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::less<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::allocator<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > > const&, bool) const ()
#9  0x00000055563b3d60 in arm_compute::create_kernel(arm_compute::CLCompileContext const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::set<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::less<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::allocator<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > > const&) ()
#10 0x00000055563e156c in arm_compute::opencl::kernels::ClLogits1DMaxShiftExpSumKernel::configure(arm_compute::CLCompileContext const&, arm_compute::ITensorInfo const&, arm_compute::ITensorInfo&, arm_compute::ITensorInfo&, arm_compute::ITensorInfo&, arm_compute::SoftmaxKernelInfo const&) ()
#11 0x00000055563df1b8 in arm_compute::opencl::ClSoftmax::configure(arm_compute::CLCompileContext const&, arm_compute::ITensorInfo const&, arm_compute::ITensorInfo&, arm_compute::SoftmaxKernelInfo const&) ()
#12 0x00000055563dd388 in arm_compute::CLSoftmaxLayerGeneric<false>::configure(arm_compute::CLCompileContext const&, arm_compute::ICLTensor const*, arm_compute::ICLTensor*, float, int) ()
#13 0x0000005556386dcc in main ()
morgolock commented 7 months ago

Hi @abcdrm

From the callstack you shared I see the problem is when the program tries to compile OpenCL code. What GPU have you got? Is OpenCL working on your device? ACL is designed and optimised for MALI GPUs. It would be good if you could build the library with debug=1 and run your test again to get more information.

I've noticed another problem with your code: you need to move the code allocating and initializing the tensors after the call to configure() as shown below

CLSoftmaxLayer softmax_op_;
softmax_op_.configure(&cl_in, &cl_out, 1.0f, 0);

cl_in.allocator()->allocate();
cl_out.allocator()->allocate();
cl_in.map(true);
auto ptr = reinterpret_cast< uint16_t* >(cl_in.buffer());
for (int i = 0; i < 32 * 32; ++i) {
        ptr[i] = 10000 + i;
}
cl_in.unmap();
softmax_op_.run();

Please see our example https://github.com/ARM-software/ComputeLibrary/blob/main/examples/cl_sgemm.cpp#L139

Hope this helps

abcdrm commented 7 months ago

Hi @morgolock My code is running on Adreno 730 from Qualcomm. I have tested both DataType::F16 and DataType::F32 as input, only DataType::F16 has issue above, DataType::F32 works as expected. For the order of tensor allocate and op configure, I tested both order using DataType::F32, seems not causing any problems, may you explain more details about why allocate() should appears after configure()?

morgolock commented 7 months ago

Hi @abcdrm It looks like a problem when building the OpenCL kernel with the FP16 data type. Maybe if you build the debug version of the library you can get more details, you can use scons debug=1 to do this.

Have you tried any other OpenCL code using FP16 on your device?

For the order of the allocation, it's because of the padding in the tensors. For more details please see the documentation: https://arm-software.github.io/ComputeLibrary/latest/architecture.xhtml#architecture_images_tensors_padding_and_border

Hope this helps