Open jeffhammond opened 7 years ago
I see the same thing in https://github.com/jeffhammond/PRK/blob/9fdcc953e8a962a9d13508e3a3a092c07c05fd45/Cxx11/transpose-cuda.cu so it is presumably a problem with the low-level implementation.
With CUDA 8.0, I don't see these issues any more, at least with OpenCL.
jrhammon@klondike:~/Work/PRK/github-official/Cxx11$ ./transpose-opencl 10 1296
./transpose-opencl: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./transpose-opencl)
./transpose-opencl: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./transpose-opencl)
Parallel Research Kernels version 2.16
C++11/OpenCL Matrix transpose: B = A^T
Available OpenCL platforms:
CL_PLATFORM_NAME=NVIDIA CUDA, CL_PLATFORM_VENDOR=NVIDIA Corporation (DEFAULT)
CL_DEVICE_NAME=GeForce GTX 960
CL_DEVICE_VENDOR=NVIDIA Corporation
CL_DEVICE_AVAILABLE=1
CL_DEVICE_TYPE=GPU
CL_DEVICE_MAX_COMPUTE_UNITS=8
CL_DEVICE_GLOBAL_MEM_SIZE=2090270720
CL_DEVICE_MAX_CLOCK_FREQUENCY=1228
CL_DEVICE_MAX_MEM_ALLOC_SIZE=522567680
CL_DEVICE_LOCAL_MEM_SIZE=49152
CL_DEVICE_EXTENSIONS contains cl_khr_fp64
CL_PLATFORM_NAME=Intel(R) OpenCL, CL_PLATFORM_VENDOR=Intel(R) Corporation
CL_DEVICE_NAME=Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
CL_DEVICE_VENDOR=Intel(R) Corporation
CL_DEVICE_AVAILABLE=1
CL_DEVICE_TYPE=CPU
CL_DEVICE_MAX_COMPUTE_UNITS=16
CL_DEVICE_GLOBAL_MEM_SIZE=16645246976
CL_DEVICE_MAX_CLOCK_FREQUENCY=3000
CL_DEVICE_MAX_MEM_ALLOC_SIZE=4161311744
CL_DEVICE_LOCAL_MEM_SIZE=32768
CL_DEVICE_EXTENSIONS contains cl_khr_fp64
Matrix order = 1296
Number of iterations = 10
CPU Precision = 64-bit
Solution validates
Rate (MB/s): 15035.8 Avg time (s): 0.00178733
GPU Precision = 64-bit
Solution validates
Rate (MB/s): 20127.7 Avg time (s): 0.00133517
OpenCL transpose breaks with matrices of rank 1296 or greater with the NVIDIA OpenCL implementation. This is NVIDIA-specific, because the Intel OpenCL is fine for much larger matrices.
It is possible that there is something that I can query to know in advance that this problem will appear.
CL_DEVICE_ADDRESS_BITS
exists but if the problem is 32b indexing, that should not manifest at 1296 (which is only 12.8 MiB).