ROCm / ROCm-OpenCL-Runtime

ROCm OpenOpenCL Runtime
170 stars 60 forks source link

Dead Code Elimanation incorrectly optimises away dependend code #115

Open Dantali0n opened 4 years ago

Dantali0n commented 4 years ago

Hello I am writing an FFT algorithm in OpenCL and have found a pretty nasty bug in the ROCm OpenCL implementation. The problem resolves around the following kernel it's l2 variable:

void kernel fft(global double *real, global double *imag, ulong size, ulong power) {
    double c1 = -1.0;
    double c2 = 0.0;
    long l2 = 1;

    for (uint l = 0; l < power; l++) {
        uint l1 = l2;
        l2 <<= 1;
        double u1 = 1.0;
        double u2 = 0.0;

        for (uint j = 0; j < l1; j++) {
            for (uint i = j; i < size; i += l2) {
                uint i1 = i + l1;
                double t1 = u1 * real[i1] - u2 * imag[i1];
                double t2 = u1 * imag[i1] + u2 * real[i1];

                real[i1] = real[i] - t1;
                imag[i1] = imag[i] - t2;
                real[i] += t1;
                imag[i] += t2;
            }
            double z = ((u1 * c1) - (u2 * c2));
            u2 = ((u1 * c2) + (u2 * c1));
            u1 = z;
        }

        double onecm = 1.0 - c1;
        double onecp = 1.0 + c1;
        c2 = sqrt(onecm / 2.0);
        c1 = sqrt(onecp / 2.0);

        c2 = -c2;   
    }
}

This kernel is launched using a simple global range of 1. So no parallelism at all, single CU, single SE, single wavefront. However, the above kernel produces incorrect results.

I know for sure this is an optimization bug as forcefully printing l2 during execution makes the kernel produce correct results. Furthermore, adding -cl-opt-disable to the build program options also resolves the issue!

...
for (uint l = 0; l < power; l++) {
    uint l1 = l2;
    l2 <<= 1;
    printf("l2: %u\n", l2);
    double u1 = 1.0;
    double u2 = 0.0;
...

Once again, this can not be due to concurrency issues as the kernel is launched with

this->cl_queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(1), cl::NullRange);

Settings -WB, -simplifycfg-sink-common=0 as mentioned in the DarkTable issue does not resolve the issue. Setting the optimization to anything above -O0 will produce incorrect results.

qishilu commented 4 years ago

does it work if replace double to float ?

Dantali0n commented 3 years ago

It has been over three months any update on this?

b-sumner commented 3 years ago

Sorry for the delay. I wasn't expecting compiler concerns to be reported here. Can you provide the sources for the kernel and a standalone app which drives the kernel and checks that the result is as expected?

Dantali0n commented 3 years ago

Sorry for the delay. I wasn't expecting compiler concerns to be reported here. Can you provide the sources for the kernel and a standalone app which drives the kernel and checks that the result is as expected?

This project provides the ard-ocl target for which the source can be found in the oclfft folder. Several test cases for ard-ocl are included in the tests folder which uses boost to provide a unit test framework. The FFT function shown in a previous comment on this issue is used but produces incorrect results when compared against FFTW. The kernel is launched sequentially I.E. with a dimension of 1. When the kernel code is run on the CPU instead of using ROCM and OpenCL the results are correct.

FFTW, boost and cmake are required to run the standalone app.

perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b.tar.gz

Dantali0n commented 3 years ago

@b-sumner Hello, it has been another three months. I have provided the isolated app with test cases to compare FFTW and the before mentioned kernel on the 1st of September 2020. Could you please try it and confirm the optimization bug? Please note that the kernel works with -O0 and does not with -O1 and above hence indicating it is an optimization bug.

Dantali0n commented 3 years ago

Can someone else please look at this @vsytch @JasonTTang ???

gandryey commented 3 years ago

This is a compiler issue, not runtime. Could you report your problem here https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues ? It might help to get more attention.

Dantali0n commented 3 years ago

This is a compiler issue, not runtime. Could you report your problem here https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues ? It might help to get more attention.

I will try, honestly I have given up all hope of ever getting this fixed

b-sumner commented 3 years ago

I downloaded the link and installed everything needed to build. But the build doesn't work because kernel.sh is not found, but cmake expects it: CMakeLists.txt: COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/kernel.sh

Dantali0n commented 3 years ago

Ah I see, yes it was quite a while since I made this example for the issue. I have fixed the compilation issues now.

perf-engineering-project-ard-seq.zip

b-sumner commented 3 years ago

Well, the build went further, but... [ 87%] Building CXX object tests/CMakeFiles/testaocl.dir/__/oclfft/ard-ocl/src/ard-ocl.cxx.o ... In file included from /perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b/oclfft/ard-ocl/src/ard-ocl.cxx:1:0: /perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b/oclfft/ard-ocl/include/ard-ocl.hpp:12:10: fatal error: CL/cl2.hpp: No such file or directory

include <CL/cl2.hpp>

      ^~~~~~~~~~~~

compilation terminated. tests/CMakeFiles/testaocl.dir/build.make:85: recipe for target 'tests/CMakeFiles/testaocl.dir/__/oclfft/ard-ocl/src/ard-ocl.cxx.o' failed

Unlike several other compile commands, the one for this file did not include the "-isystem /path/to/opencl/headers"

Is this a cmake issue? I have cmake version 3.18.1