OpenCL errors for larger matrices w/ NVIDIA implementation

jeffhammond commented 7 years ago

OpenCL transpose breaks with matrices of rank 1296 or greater with the NVIDIA OpenCL implementation. This is NVIDIA-specific, because the Intel OpenCL is fine for much larger matrices.

It is possible that there is something that I can query to know in advance that this problem will appear. CL_DEVICE_ADDRESS_BITS exists but if the problem is 32b indexing, that should not manifest at 1296 (which is only 12.8 MiB).

jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1295
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platform: NVIDIA CUDA
Available OpenCL platform: Intel(R) OpenCL
Matrix order          = 1295
Number of iterations  = 10
Solution validates
Rate (MB/s): 12611.9 Avg time (s): 0.00106378

jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1296
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platform: NVIDIA CUDA
Available OpenCL platform: Intel(R) OpenCL
Matrix order          = 1296
Number of iterations  = 10
ERROR: Aggregate squared error 1896 exceeds threshold 1e-08

jeffhammond commented 7 years ago

I see the same thing in https://github.com/jeffhammond/PRK/blob/9fdcc953e8a962a9d13508e3a3a092c07c05fd45/Cxx11/transpose-cuda.cu so it is presumably a problem with the low-level implementation.

jeffhammond commented 7 years ago

With CUDA 8.0, I don't see these issues any more, at least with OpenCL.

jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1296
./transpose-opencl: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./transpose-opencl)
./transpose-opencl: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./transpose-opencl)
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platforms: 
CL_PLATFORM_NAME=NVIDIA CUDA, CL_PLATFORM_VENDOR=NVIDIA Corporation (DEFAULT)
   CL_DEVICE_NAME=GeForce GTX 960
   CL_DEVICE_VENDOR=NVIDIA Corporation
   CL_DEVICE_AVAILABLE=1
   CL_DEVICE_TYPE=GPU
   CL_DEVICE_MAX_COMPUTE_UNITS=8
   CL_DEVICE_GLOBAL_MEM_SIZE=2090270720
   CL_DEVICE_MAX_CLOCK_FREQUENCY=1228
   CL_DEVICE_MAX_MEM_ALLOC_SIZE=522567680
   CL_DEVICE_LOCAL_MEM_SIZE=49152
   CL_DEVICE_EXTENSIONS contains cl_khr_fp64

CL_PLATFORM_NAME=Intel(R) OpenCL, CL_PLATFORM_VENDOR=Intel(R) Corporation
   CL_DEVICE_NAME=Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
   CL_DEVICE_VENDOR=Intel(R) Corporation
   CL_DEVICE_AVAILABLE=1
   CL_DEVICE_TYPE=CPU
   CL_DEVICE_MAX_COMPUTE_UNITS=16
   CL_DEVICE_GLOBAL_MEM_SIZE=16645246976
   CL_DEVICE_MAX_CLOCK_FREQUENCY=3000
   CL_DEVICE_MAX_MEM_ALLOC_SIZE=4161311744
   CL_DEVICE_LOCAL_MEM_SIZE=32768
   CL_DEVICE_EXTENSIONS contains cl_khr_fp64

Matrix order          = 1296
Number of iterations  = 10
CPU Precision         = 64-bit
Solution validates
Rate (MB/s): 15035.8 Avg time (s): 0.00178733
GPU Precision         = 64-bit
Solution validates
Rate (MB/s): 20127.7 Avg time (s): 0.00133517

ParRes / Kernels

OpenCL errors for larger matrices w/ NVIDIA implementation #183