DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
It seems DGEQRF (PTG at least) is broken for CUDA runs
To Reproduce
Steps to reproduce the behavior:
Checkout current master
Compile with CUDA enabled (e.g. use modules hwloc cuda gcc openmpi gdb ninja cmake intel-mkl python on leconte) and let cmake detect everything
Run ./tests/testing_dgeqrf -N 4096 -t 1024 -x -g 1
See error
Expected behavior
The CUDA driver complains of misaligned memory accesses and bails out
~/dplasma/out/build/Debug $ ./tests/testing_dgeqrf -N 4096 -t 1024 -x -g 1
W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: The binding defined by --parsec_bind has been ignored!
This option requires a build with HWLOC with bitmap support.
#+++++ cores detected : 80
#+++++ nodes x cores + gpu : 1 x 80 + 1 (80+1)
#+++++ thread mode : THREAD_SERIALIZED
#+++++ P x Q : 1 x 1 (1/1)
#+++++ M x N x K|NRHS : 4096 x 4096 x 1
#+++++ MB x NB , IB : 1024 x 1024 , 32
#+++++ KP x KQ : 4 x 1
W@00000 /home/herault/dplasma/parsec/parsec/mca/device/cuda/device_cuda_module.c:2012 (progress_stream) cudaEventQuery an illegal memory access was encountered
W@00000 Critical issue related to the GPU discovered. Giving up
Describe the bug
It seems DGEQRF (PTG at least) is broken for CUDA runs
To Reproduce
Steps to reproduce the behavior:
hwloc cuda gcc openmpi gdb ninja cmake intel-mkl python
on leconte) and let cmake detect everything./tests/testing_dgeqrf -N 4096 -t 1024 -x -g 1
Expected behavior
The CUDA driver complains of misaligned memory accesses and bails out