boostorg / compute

A C++ GPU Computing Library for OpenCL
http://boostorg.github.io/compute/
Boost Software License 1.0
1.56k stars 333 forks source link

Sort algorithms fail running on AMD Radeon RX Vega 56 #811

Closed rosenrodt closed 5 years ago

rosenrodt commented 5 years ago

Similar to issue reported in #795, I am facing all kinds of sort() failures with AMD RX Vega GPUs

I ran the test with the command ctest --output-on-failure. And here is the summary

Test result for driver Adrenalin 18.5.2 & 18.9.1

The following tests FAILED: 54 - algorithm.radix_sort (Failed) 55 - algorithm.radix_sort_by_key (Failed) 75 - algorithm.sort_by_key (Failed) 77 - algorithm.stable_sort (Failed) 143 - misc.amd_cpp_kernel_language (Failed) 148 - example.amd_cpp_kernel (Exit code 0xc0000409)

Test result for driver Adrenalin 18.12.2

The following tests FAILED: 41 - algorithm.insertion_sort (Failed) 45 - algorithm.merge_sort_gpu (Failed) 54 - algorithm.radix_sort (Failed) 55 - algorithm.radix_sort_by_key (Failed) 74 - algorithm.sort (Failed) 75 - algorithm.sort_by_key (Failed) 77 - algorithm.stable_sort (Failed) 143 - misc.amd_cpp_kernel_language (Failed) 148 - example.amd_cpp_kernel (Exit code 0xc0000409)

Bold items mean the test failed on new driver but not in older drivers

Curiously, with latest drivers it gets even worse.

Note 1. For complete failure reports look here.
Note 2. As I recall, AMD Radeon HD 6770 passes every test (maybe except for the amd_cpp_kernel tests)

rosenrodt commented 5 years ago

Back with some quick update: Adrenalin 18.12.2 doesn't seem like a quality driver so let's just ignore it. I believe this issue is caused by pointer aliasing inside radix sort kernels. Will post a PR as soon as it is been confirmed

jszuppe commented 5 years ago

Which pointers in the loop are aliasing each other?

rosenrodt commented 5 years ago

I figured it’s not pointer aliasing in the scan() kernel I checked (see the work-in-progress pr #812). But rather variable dependency is not detected by the AMD OpenCL compiler.

On the other hand I am seeing really interesting bug with the AMD driver. The compare operators for char and uchar types are not working so all the errors I get from the tests are emitted by is_sorted() when testing on char types. Fortunately the equality check for sorted char arrays are all passed

jszuppe commented 5 years ago

Can you check Adrenalin 19.1.1?

jszuppe commented 5 years ago

It would also be great to open an issue on AMD.

btw. I guess we should disable/remove AMD C++ tests. They're proboly not supported in new OpenCL drivers.

rosenrodt commented 5 years ago

Can you check Adrenalin 19.1.1?

No luck with 19.1.1 too

jszuppe commented 5 years ago

Fixed in #812, @rosenrodt are you planning to open bug for AMD driver?

rosenrodt commented 5 years ago

@jszuppe I do but I am not sure where to open a support ticket. Any suggestion?

jszuppe commented 5 years ago

I think https://community.amd.com/community/devgurus/opencl is the best place for that. However, they may ask you to make a small, independent program for bug reproduction.

rosenrodt commented 5 years ago

I'll post on the forum sometime this week

rosenrodt commented 5 years ago

Back with some updates :)

I opened a ticket to report the driver bug that appears on Adrenalin 18.12.2 and onwards https://community.amd.com/message/2897171. The mod can't repro the issue on Hawaii GPUs (no surprise, as it only happens on recent GPUs) so he is passing it to the relevant teams

As for the memory fence workaround I will post as a separate bug report on AMD forum.

rosenrodt commented 5 years ago

I opened a ticket to report the driver bug that appears on Adrenalin 18.12.2 and onwards https://community.amd.com/message/2897171. The mod can't repro the issue on Hawaii GPUs (no surprise, as it only happens on recent GPUs) so he is passing it to the relevant teams

Though not directly confirmed by AMD staff yet, the issue of comparing char types using boost::compute::is_sorted() seems to be resolved as of Adrenalin 19.3.2. Both the standalone minimal test sample and Boost Compute master branch now work as expected.

So I consider this issue resolved.