testing_dgeqrf fails in CUDA runs

Describe the bug

It seems DGEQRF (PTG at least) is broken for CUDA runs

To Reproduce

Steps to reproduce the behavior:

Checkout current master
Compile with CUDA enabled (e.g. use modules hwloc cuda gcc openmpi gdb ninja cmake intel-mkl python on leconte) and let cmake detect everything
Run ./tests/testing_dgeqrf -N 4096 -t 1024 -x -g 1
See error

Expected behavior

The CUDA driver complains of misaligned memory accesses and bails out

~/dplasma/out/build/Debug $ ./tests/testing_dgeqrf -N 4096 -t 1024 -x -g 1
W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: The binding defined by --parsec_bind has been ignored!
    This option requires a build with HWLOC with bitmap support.
#+++++ cores detected       : 80
#+++++ nodes x cores + gpu  : 1 x 80 + 1 (80+1)
#+++++ thread mode          : THREAD_SERIALIZED
#+++++ P x Q                : 1 x 1 (1/1)
#+++++ M x N x K|NRHS       : 4096 x 4096 x 1
#+++++ MB x NB , IB         : 1024 x 1024 , 32
#+++++ KP x KQ              : 4 x 1
W@00000 /home/herault/dplasma/parsec/parsec/mca/device/cuda/device_cuda_module.c:2012 (progress_stream) cudaEventQuery an illegal memory access was encountered
W@00000 Critical issue related to the GPU discovered. Giving up

ICLDisco / dplasma

testing_dgeqrf fails in CUDA runs #70

Describe the bug

To Reproduce

Expected behavior