Closed waredjeb closed 5 years ago
@waredjeb thanks for your report. Not sure if relevant to the issue, but just in case can you please provide the OS, compiler, cuda version, and CMake options used for alpaka?
Edit: using the develop
branch is correct.
@sbastrakov thanks for the quick reply.
I'm on CentOS 7.6, cuda10.1. Actually I'm compiling with nvcc
and gcc
, gcc
version 8.3.1.
Compilation flags:
CXXFLAGS="-m64 -std=c++14 -g -O2 -DALPAKA_DEBUG=0 -I$CUDA_ROOT/include -I$ALPAKA_ROOT/include
HOST_FLAGS="-fopenmp -pthread -fPIC -ftemplate-depth-512 -Wall -Wextra -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-local-typedefs -Wno-attributes -Wno-reorder -Wno-sign-compare"
NVCC_FLAGS="-ccbin $CXX -w -lineinfo --expt-extended-lambda --expt-relaxed-constexpr --generate-code arch=compute_50,code=sm_50 --use_fast_math --ftz=false --cudart shared"
I am not sure this way enables the CUDA backend of alpaka. We would normally enable it via cmake configuration. If this is indeed the issue (which I am not sure about), including <alpaka/standalone/GpuCudaRt.hpp>
before other alpaka includes might help.
cc @psychocoderHPC @BenjaminW3 you probably have a better idea than I do.
The error description: "This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration."
From this decription my guess would be that the GPU that was selected via alpaka::pltf::getDevByIdx<Pltf>(0u)
does not support sm_50
given on the command line.
@waredjeb Can you run CMake again with -DALPAKA_DEBUG=2
? This should add extended traces to the output (and make everything slower) so that we can see what GPU was selected?
@BenjaminW3 here the output:
TEST MULTIPLYProduct of type 2 x 2 * 2 x 2[+] getDevByIdx
[+] getDevCount
[-] getDevCount
[+] printDeviceProperties
name: Tesla T4
totalGlobalMem: 15079 MiB
sharedMemPerBlock: 48 KiB
regsPerBlock: 65536
warpSize: 32
memPitch: 2147483647 B
maxThreadsPerBlock: 1024
maxThreadsDim[3]: (1024, 1024, 64)
maxGridSize[3]: (2147483647, 65535, 65535)
clockRate: 1590000 kHz
totalConstMem: 64 KiB
major: 7
minor: 5
textureAlignment: 512
texturePitchAlignment: 32
multiProcessorCount: 40
kernelExecTimeoutEnabled: 0
integrated: 0
canMapHostMemory: 1
computeMode: 0
maxTexture1D: 131072
maxTexture1DLinear: 134217728
maxTexture2D[2]: 131072x65536
maxTexture2DLinear[3]: 131072x65000x2097120
maxTexture2DGather[2]: 32768x32768
maxTexture3D[3]: 16384x16384x16384
maxTextureCubemap: 32768
maxTexture1DLayered[2]: 32768x2048
maxTexture2DLayered[3]: 32768x32768x2048
maxTextureCubemapLayered[2]: 32768x2046
maxSurface1D: 32768
maxSurface2D[2]: 131072x65536
maxSurface3D[3]: 16384x16384x16384
maxSurface1DLayered[2]: 32768x2048
maxSurface2DLayered[3]: 32768x32768x2048
maxSurfaceCubemap: 32768
maxSurfaceCubemapLayered[2]: 32768x2046
surfaceAlignment: 512
concurrentKernels: 1
ECCEnabled: 1
pciBusID: 2
pciDeviceID: 0
pciDomainID: 0
tccDriver: 0
asyncEngineCount: 3
unifiedAddressing: 1
memoryClockRate: 5001000 kHz
memoryBusWidth: 256 b
l2CacheSize: 4194304 B
maxThreadsPerMultiProcessor: 1024
[-] printDeviceProperties
[-] getDevByIdx
[+] QueueCudaRtAsyncImpl
[-] QueueCudaRtAsyncImpl
[+] getDevByIdx
[+] getDevCount
[-] getDevCount
[-] getDevByIdx
[+] alloc
[+] BufCpuImpl
BufCpuImpl e: (1) ptr: 0x28e2780 pitch: 32
[-] BufCpuImpl
[-] alloc
[+] alloc
[+] BufCpuImpl
BufCpuImpl e: (1) ptr: 0x28e2800 pitch: 32
[-] BufCpuImpl
[-] alloc
[+] alloc
[+] BufCpuImpl
BufCpuImpl e: (1) ptr: 0x28e2880 pitch: 32
[-] BufCpuImpl
[-] alloc
Matrix 2x2
Matrix(0,0) = 0.814927
Matrix(0,1) = 1.884702
Matrix(1,0) = 1.981513
Matrix(1,1) = 1.664239
Matrix 2x2
Matrix(0,0) = 1.058731
Matrix(0,1) = 1.997586
Matrix(1,0) = 0.550919
Matrix(1,1) = 0.873779
Matrix 2x2
Matrix(0,0) = 0.000000
Matrix(0,1) = 0.000000
Matrix(1,0) = 0.000000
Matrix(1,1) = 0.000000
[+] alloc
alloc ew: 1 ewb: 32 ptr: 0x7fd5d7000000
[+] BufCudaRt
[-] BufCudaRt
[-] alloc
[+] alloc
alloc ew: 1 ewb: 32 ptr: 0x7fd5d7000200
[+] BufCudaRt
[-] BufCudaRt
[-] alloc
[+] alloc
alloc ew: 1 ewb: 32 ptr: 0x7fd5d7000400
[+] BufCudaRt
[-] BufCudaRt
[-] alloc
[+] createTaskCopy
[-] createTaskCopy
[+] enqueue
printDebug ddev: 0 ew: 1 ewb: 32 dw: 1 dptr: 0x7fd5d7000000 sdev: 0 sw: 1 sptr: 0x28e2780
[-] enqueue
[+] createTaskCopy
[-] createTaskCopy
[+] enqueue
printDebug ddev: 0 ew: 1 ewb: 32 dw: 1 dptr: 0x7fd5d7000200 sdev: 0 sw: 1 sptr: 0x28e2800
[-] enqueue
[+] createTaskCopy
[-] createTaskCopy
[+] enqueue
printDebug ddev: 0 ew: 1 ewb: 32 dw: 1 dptr: 0x7fd5d7000400 sdev: 0 sw: 1 sptr: 0x28e2880
[-] enqueue
createTaskKernel gridBlockExtent: (1), blockThreadExtent: (1)
[+] enqueue
enqueue gridDim: 1 1 1 blockDim: 1 1 1
enqueue BlockSharedMemDynSizeBytes: 0 B
enqueue binaryVersion: 0 constSizeBytes: 0 B localSizeBytes: 0 B maxThreadsPerBlock: 0 numRegs: 0 ptxVersion: 0 sharedSizeBytes: 0 B
/data/user/wredjeb/cupla/alpaka/include/alpaka/kernel/TaskKernelGpuCudaRt.hpp(375) 'cudaSetDevice( queue.m_spQueueImpl->m_dev.m_iDevice)' A previous CUDA call (not this one) set the error : 'cudaErrorInvalidDeviceFunction': 'invalid device function'!
Illegal instruction
Where does the sm_50
on the command line come from? Are you explicitly setting this when calling CMake? Could you try to use -DALPAKA_CUDA_ARCH=75
?
-DALPAKA_CUDA_ARCH=75
worked!
Thanks!
Thanks, I will close this now.
Keep in mind that each GPU requires a different architecture, so if you change the system this runs on you have to change the CUDA device architecture. You can also compile for multiple architectures by using -DALPAKA_CUDA_ARCH="50;70;75"
. This allows the compiled code to run on all those architectures but this will also increase the compilation time.
Hello, I'm trying to port a simple code for GPU-CUDARt. There is a simple matrix multiplication using the
Eigen
library. Well, I can compile but I got the following error during the execution:terminate called after throwing an instance of 'std::runtime_error' what(): /data/user/wredjeb/cupla/alpaka/include/alpaka/mem/buf/cuda/Copy.hpp(861) 'cudaSetDevice( iDstDev)' A previous CUDA call (not this one) set the error : 'cudaErrorNoKernelImageForDevice': 'no kernel image is available for execution on the device'!
Actually the error raises only when I enqueue the task (alpaka::queue::enqueue(queue, TaskKernelGpuCudaRt);
), but in both cases it seems that the kernel doesn't work since it returns a final matrix with only zeros. In the following the kernel and the piece of header that I use in the kernel to fill and print the matrices. HEADERMultiplication KERNEL
Function that calls the kernel
MAIN
Error
I'm using the develop branch of alpaka. This is my first approach with alpaka, what am I doing wrong?