ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
699 stars 81 forks source link

AMD SI Cards weird results #254

Closed skn123 closed 5 years ago

skn123 commented 5 years ago

A bug with AMD SI cards was reported in this thread https://community.amd.com/message/2869393#comment-2869393

A solution was also proposed on that thread. If I were to implement that solution within vexcl how do I go about doing that? Essentially, with every kernel, a null kernel would need to passed. It can safely be ignored with an #ifdef declaration for other cards. I can show the results of an example that illustrate this case:

naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ ./complex_simple

  1. Hainan (AMD Accelerated Parallel Processing)

X Y = (0,16) (16,0) = (-5.62355e+303,7.1998e-304)i X Y = (1,15) (15,1) = (1.76736e+186,-2.529e-186)i X Y = (2,14) (14,2) = (-5.43986e-256,-7.4239e-199)i X Y = (3,13) (13,3) = (-1.15801e-125,-1.42689e-184)i X Y = (4,12) (12,4) = (-1.5799e-103,-2.30511e+235)i X Y = (5,11) (11,5) = (2.83923e+103,-2.783e+307)i X Y = (6,10) (10,6) = (-4.95723e+305,1.35145e+188)i X Y = (7,9) (9,7) = (-2.73677e-48,-0.00275755)i X Y = (8,8) (8,8) = (-2.26843e-106,-1.45955e-201)i X Y = (9,7) (7,9) = (1.79762e+106,2.97045e+201)i X Y = (10,6) (6,10) = (-nan,2.14326e-308)i X Y = (11,5) (5,11) = (-8.49166e-200,5.26253e+199)i X Y = (12,4) (4,12) = (-3.88897e+306,2.78449e+188)i X Y = (13,3) (3,13) = (1.69952e+184,1.77589e-234)i X Y = (14,2) (2,14) = (-1.94762e-104,-4.35661e+232)i X Y = (15,1) (1,15) = (-1.82884e-128,1.00333e-232)i X / Y = (0,16) / (16,0) = (0,256) X / Y = (1,15) / (15,1) = (0,226) X / Y = (2,14) / (14,2) = (0,200) X / Y = (3,13) / (13,3) = (0,178) X / Y = (4,12) / (12,4) = (0,160) X / Y = (5,11) / (11,5) = (0,146) X / Y = (6,10) / (10,6) = (0,136) X / Y = (7,9) / (9,7) = (0,130) X / Y = (8,8) / (8,8) = (0,128) X / Y = (9,7) / (7,9) = (0,130) X / Y = (10,6) / (6,10) = (0,136) X / Y = (11,5) / (5,11) = (0,146) X / Y = (12,4) / (4,12) = (0,160) X / Y = (13,3) / (3,13) = (0,178) X / Y = (14,2) / (2,14) = (0,200) X / Y = (15,1) / (1,15) = (0,226)

If I run it the second time naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ ./complex_simple

  1. Hainan (AMD Accelerated Parallel Processing)

X Y = (0,16) (16,0) = (0,1)i X Y = (1,15) (15,1) = (0.132743,0.99115)i X Y = (2,14) (14,2) = (0.28,0.96)i X Y = (3,13) (13,3) = (0.438202,0.898876)i X Y = (4,12) (12,4) = (0.6,0.8)i X Y = (5,11) (11,5) = (0.753425,0.657534)i X Y = (6,10) (10,6) = (0.882353,0.470588)i X Y = (7,9) (9,7) = (0.969231,0.246154)i X Y = (8,8) (8,8) = (1,0)i X Y = (9,7) (7,9) = (0.969231,-0.246154)i X Y = (10,6) (6,10) = (0.882353,-0.470588)i X Y = (11,5) (5,11) = (0.753425,-0.657534)i X Y = (12,4) (4,12) = (0.6,-0.8)i X Y = (13,3) (3,13) = (0.438202,-0.898876)i X Y = (14,2) (2,14) = (0.28,-0.96)i X Y = (15,1) (1,15) = (0.132743,-0.99115)i X / Y = (0,16) / (16,0) = (0,256) X / Y = (1,15) / (15,1) = (0,226) X / Y = (2,14) / (14,2) = (0,200) X / Y = (3,13) / (13,3) = (0,178) X / Y = (4,12) / (12,4) = (0,160) X / Y = (5,11) / (11,5) = (0,146) X / Y = (6,10) / (10,6) = (0,136) X / Y = (7,9) / (9,7) = (0,130) X / Y = (8,8) / (8,8) = (0,128) X / Y = (9,7) / (7,9) = (0,130) X / Y = (10,6) / (6,10) = (0,136) X / Y = (11,5) / (5,11) = (0,146) X / Y = (12,4) / (4,12) = (0,160) X / Y = (13,3) / (3,13) = (0,178) X / Y = (14,2) / (2,14) = (0,200) X / Y = (15,1) / (1,15) = (0,226)

The results look fine now. The result of clinfo is naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ clinfo Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 AMD-APP (2639.3) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon (TM) R5 M330 Device Topology: PCI[ B#1, D#0, F#0 ] Max compute units: 5 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 750Mhz Address bits: 64 Max memory allocation: 1596905472 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 2146349056 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities:
Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities:
Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties:
Out-of-Order: No Profiling : Yes Queue on Device properties:
Out-of-Order: No Profiling : No Platform ID: 0x7f5a050f49f0 Name: Hainan Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 2639.3 Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2639.3) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

ddemidov commented 5 years ago

Can you please clarify what exactly is the workaround here? Is it enough to just declare a dummy kernel before or after the source of the real one? Or should the dummy kernel be called after the real one?

skn123 commented 5 years ago

The workaround is given here https://github.com/rdemaria/sixtracklib_gsoc18/blob/master/studies/study0/hello_workaround.cpp Lines 96-98 queue.enqueueNDRangeKernel( null, cl::NullRange, 1, cl::NullRange); // null kernel seems to solve the issue I think this is done after the main ones.

ddemidov commented 5 years ago

Ok, you can try to add the workaround after line 130 here:

https://github.com/ddemidov/vexcl/blob/c0f9213fb09008077789d841e63ba99a51f7b912/vexcl/backend/opencl/kernel.hpp#L128-L132

If this works for you, you'll probably need to guard it with a runtime condition (because you can have several cards in a vexcl context, and only some of those might need the workaround). You can use the command queue object to check if the workaround is necessary, something along these lines (you can use get_device() function to get the compute device associated with the queue):

https://github.com/ddemidov/vexcl/blob/c0f9213fb09008077789d841e63ba99a51f7b912/tests/fft.cpp#L117

skn123 commented 5 years ago

I tried this

/// Enqueue the kernel to the specified command queue.
        void operator()(const cl::CommandQueue &q) {
            q.enqueueNDRangeKernel(K, cl::NullRange, g_size, w_size);
            std::vector<cl::Platform> platformList;

                // Pick platform
                cl::Platform::get(&platformList);
            // NULL
            static const char sourceNull[] =
                "#if defined(cl_khr_fp64)\n"
                "#  pragma OPENCL EXTENSION cl_khr_fp64: enable\n"
                "#elif defined(cl_amd_fp64)\n"
                "#  pragma OPENCL EXTENSION cl_amd_fp64: enable\n"
                "#else\n"
                "#  error double precision is not supported\n"
                "#endif\n"
                "kernel void null(\n"
                "       )\n"
                "{\n"
                "}\n"
                ;

                // Pick first platform
                cl_context_properties cprops[] = {
                    CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};
                cl::Context context(CL_DEVICE_TYPE_GPU, cprops);

                // Query the set of devices attched to the context
                std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

                // Create and program from source
                cl::Program::Sources sources(1, std::make_pair(sourceNull, 0));
                cl::Program program(context, sources);

                // Build program
                program.build(devices);

                // Create kernel object
                cl::Kernel kNull(program, "null");
            q.enqueueNDRangeKernel(kNull, cl::NullRange, g_size, w_size);
            argpos = 0;
        }

This is the only way I can insert that null kernel here given the limitations of the current API. If I do this, then during running the program, I am getting this:

naths@naths-HP-Laptop-15-bs1xx:~/build/vexcl/examples$ ./complex_simple 
1. Hainan (AMD Accelerated Parallel Processing)

terminate called after throwing an instance of 'cl::Error'
  what():  clEnqueueNDRangeKernel
Aborted (core dumped)

Am I doing something wrong here?

ddemidov commented 5 years ago

Something like 01550e2 should be enough to see if the workaround actually helps. Can you check that? Then we can think about limiting it to AMD SI GPUs.

skn123 commented 5 years ago

and bingo !

  1. Hainan (AMD Accelerated Parallel Processing)

X Y = (0,16) (16,0) = (0,256)i X Y = (1,15) (15,1) = (0,226)i X Y = (2,14) (14,2) = (0,200)i X Y = (3,13) (13,3) = (0,178)i X Y = (4,12) (12,4) = (0,160)i X Y = (5,11) (11,5) = (0,146)i X Y = (6,10) (10,6) = (0,136)i X Y = (7,9) (9,7) = (0,130)i X Y = (8,8) (8,8) = (0,128)i X Y = (9,7) (7,9) = (0,130)i X Y = (10,6) (6,10) = (0,136)i X Y = (11,5) (5,11) = (0,146)i X Y = (12,4) (4,12) = (0,160)i X Y = (13,3) (3,13) = (0,178)i X Y = (14,2) (2,14) = (0,200)i X Y = (15,1) (1,15) = (0,226)i X / Y = (0,16) / (16,0) = (0,1) X / Y = (1,15) / (15,1) = (0.132743,0.99115) X / Y = (2,14) / (14,2) = (0.28,0.96) X / Y = (3,13) / (13,3) = (0.438202,0.898876) X / Y = (4,12) / (12,4) = (0.6,0.8) X / Y = (5,11) / (11,5) = (0.753425,0.657534) X / Y = (6,10) / (10,6) = (0.882353,0.470588) X / Y = (7,9) / (9,7) = (0.969231,0.246154) X / Y = (8,8) / (8,8) = (1,0) X / Y = (9,7) / (7,9) = (0.969231,-0.246154) X / Y = (10,6) / (6,10) = (0.882353,-0.470588) X / Y = (11,5) / (5,11) = (0.753425,-0.657534) X / Y = (12,4) / (4,12) = (0.6,-0.8) X / Y = (13,3) / (3,13) = (0.438202,-0.898876) X / Y = (14,2) / (2,14) = (0.28,-0.96) X / Y = (15,1) / (1,15) = (0.132743,-0.99115)

ddemidov commented 5 years ago

So, how do you check if the workaround is required? Is 'Hainan' for a GPU name enough?

skn123 commented 5 years ago

actually yes. I would suggest it being within an ifdef statement. By default it is set to false in cmake and only if the user wants it gets enabled. Very localized. Hainan is one such name. The actual name as given by ubuntu is 01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Sun XT [Radeon HD 8670A/8670M/8690M / R5 M330] (rev 83)

ddemidov commented 5 years ago

Would you be able to implement it?

skn123 commented 5 years ago

I can implement the CMake part and can send you a PR. Another topic. Can such a workaround be implemented in ViennaCl also; given that we have a lot of examples in vexcl also?

skn123 commented 5 years ago

Unfortunately, the WIP does not show up in the forked code. I can only see it when I pull it from Master and do a reset. I think the changes are very minimal

IFDEF AMD_SI_WORKAROUND

....

ENDIF

and in cmake a cache variable that prompts the user to check it if his card is a SI GPU (and a reference to this issue) would suffice,

ddemidov commented 5 years ago

Please see if #255 is working for you. Re ViennaCL, you would have to open a separate issue there. I am not sure what is the best way to implement similar workaround in ViennaCL.

skn123 commented 5 years ago

I have posted a comment there. If you must go with naming then there are other SI card names also https://wiki.gentoo.org/wiki/AMDGPU that should be included. Else you can drop it and direct the user to that web page to check if his/her card is a Southern Island card. Otherwise the workaround is fine.

skn123 commented 5 years ago

There is another comment that I have. Suppose if one wants to enqueue multiple OpenCL kernels, then the original code is the correct way to do it. The workaround will fail there as we would be creating a Null kernel after every code. I think a new function would need to be defined (and the API for the current code to be changed) to call the Null kernel only when all other kernels have been passed to the device. Of course, the #VEXCL_AMD_SI_WORKAROUND will be present everywhere.

ddemidov commented 5 years ago

I don't see how the workaround can alter outcome of any computation. It just puts an empty kernel after each normal one into the command queue. Also, you can just disable the workaround and submit an empty kernel yourself at any time you like.

skn123 commented 5 years ago

Precisely. We would need something in the API to enable/disable placing the empty kernel. Currently it is enabled by default.

ddemidov commented 5 years ago

you either

  1. enable the workaround, which means the empty kernel is submitted after each operation, or
  2. you disable the workaround, create an empty kernel yourself (using vexcl custom kernel API), and submit it whenever you fill you need to.
skn123 commented 5 years ago

I am talking about 2. Currently, the only way I can enable the workaround is via an #ifdef. A cleaner way to implement this would be to add an auxiliary function to add the empty kernel (again protected by the #ifdef). Then whatever you say holds true and it does not affect anyone who do not need the workaround.

ddemidov commented 5 years ago

Currently adding an empty kernel is as simple as

vex::backend::kernel dummy(ctx.queue(0), "kernel void dummy() {}", "dummy");

and then you can submit it with

dummy(ctx.queue(0));

I would say it's easy enough to not invent a special API for this.

skn123 commented 5 years ago

Great... that should help :) I think this can be closed then