Compiler incorrectly optimises away dependend code

Dantali0n commented 3 years ago

Hello I am writing an FFT algorithm in OpenCL and have found a pretty nasty bug in the ROCm OpenCL implementation. The problem resolves around the following kernel it's l2 variable:

void kernel fft(global double *real, global double *imag, ulong size, ulong power) {
    double c1 = -1.0;
    double c2 = 0.0;
    long l2 = 1;

    for (uint l = 0; l < power; l++) {
        uint l1 = l2;
        l2 <<= 1;
        double u1 = 1.0;
        double u2 = 0.0;

        for (uint j = 0; j < l1; j++) {
            for (uint i = j; i < size; i += l2) {
                uint i1 = i + l1;
                double t1 = u1 * real[i1] - u2 * imag[i1];
                double t2 = u1 * imag[i1] + u2 * real[i1];

                real[i1] = real[i] - t1;
                imag[i1] = imag[i] - t2;
                real[i] += t1;
                imag[i] += t2;
            }
            double z = ((u1 * c1) - (u2 * c2));
            u2 = ((u1 * c2) + (u2 * c1));
            u1 = z;
        }

        double onecm = 1.0 - c1;
        double onecp = 1.0 + c1;
        c2 = sqrt(onecm / 2.0);
        c1 = sqrt(onecp / 2.0);

        c2 = -c2;   
    }
}

This kernel is launched using a simple global range of 1. So no parallelism at all, single CU, single SE, single wavefront. However, the above kernel produces incorrect results.

I know for sure this is an optimization bug as forcefully printing l2 during execution makes the kernel produce correct results. Furthermore, adding -cl-opt-disable to the build program options also resolves the issue!

...
for (uint l = 0; l < power; l++) {
    uint l1 = l2;
    l2 <<= 1;
    printf("l2: %u\n", l2);
    double u1 = 1.0;
    double u2 = 0.0;
...

Once again, this can not be due to concurrency issues as the kernel is launched with

this->cl_queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(1), cl::NullRange);

Settings -WB, -simplifycfg-sink-common=0 as mentioned in the DarkTable issue does not resolve the issue. Setting the optimization to anything above -O0 will produce incorrect results.

Please also see: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/issues/115

I have attached a standalone project with an ard-ocl target for which the source can be found in the oclfft folder. Several test cases for ard-ocl are included in the tests folder which uses boost to provide a unit test framework. The FFT function shown in a previous comment on this issue is used but produces incorrect results when compared against FFTW. The kernel is launched sequentially I.E. with a dimension of 1. When the kernel code is run on the CPU instead of using ROCM and OpenCL the results are correct.

This standalone project allows to isolate the optimization bug and test if the output is correct or not. perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b.tar.gz

FFTW, boost and cmake are required to run the standalone app.

Dantali0n commented 3 years ago

I have noticed the isolated project showcasing the bug has an error, here is a updated version:

perf-engineering-project-ard-seq.zip

lamb-j commented 1 year ago

This was reportedly fixed here: https://reviews.llvm.org/D82603

Report back if you're still having issues with this though!

Dantali0n commented 1 year ago

This was reportedly fixed here: https://reviews.llvm.org/D82603

Report back if you're still having issues with this though!

Could you try to briefly explain how this patch solves this particular loop optimization, I think that would be most interesting.

ROCm / ROCm-CompilerSupport

Compiler incorrectly optimises away dependend code #36