In this line, it says to map nstream_time from the device to host at the end of the target data region (map(from:nstream_time)). However, nstream_time is never set on the device (only the host modifies it), so instead the mapping from device to host overwrites the value that was in the variable, and then the timing and bandwidth information are not correct. By the openmp spec, since the variable is mapped as from, space is allocated for it on the device (but left undefined), and at the end of the target data region, whatever value it was on the device is copied back to the host. Since nstream_time is never modified on the device, the value in it is not anything useful.
I'll note that I only noticed this when trying to run on a discrete GPU like a V100.
I think map(from:nstream_time) should just be removed from the target data region.
If this is a bug report, please use the following template.
Otherwise, please delete the rest of the template.
Where does this bug appear?
Running on a system that offloads to a nvidia V100 with clang-ykt.
[bertoni@gpu02 Kernels]$ cat common/make.defs
#name of MPI C compiler, e.g. mpiicc, mpicc
MPICC=
#name of C compiler, e.g. icc, xlc, gcc
CC=clang
#name of MPI Fortran compiler, e.g. mpifort, mpif90
MPIF90=
#name of Fortran compiler, e.g. ifort, xlf_r, gfortran
FC=
#name of compile line flag enabling OpenMP, e.g. -openmp, -qopenmp, -fopenmp
OPENMPFLAG=-fopenmp
OFFLOADFLAG= -fopenmp-targets=nvptx64-nvidia-cuda
#default compiler optimization flags
DEFAULT_OPT_FLAGS:=
############################ OPTIONAL #########################
# Fortran 2008 coarrays flag, *including any library*
# ifort: -coarray=distributed, gfortran: -fcoarray(=single) or -fcoarray=lib -lcaf_mpi, crayftn: -h caf
COARRAYFLAG=
#name of C++ compiler (to be used in MPI context for Grappa), e.g. mpigxx, mpiicpc
CXX=clang++
#name of UPC compiler, e.g. gupc, cc, upcc
UPCC=
#name of compile line flag enabling UPC if necessary, e.g. -h upc
UPCFLAG=
#name of MPI C compiler (to be used in Fine-Grain MPI context), e.g. mpicc
FGMPICC=
#name of C compiler (to be used in MPI context of OpenSHMEM), e.g. $(MPICC)
SHMEMCC=
#location where Charm++ is installed, e.g. $(HOME)/charm/mpi-linux-x86_64-ifort-smp-mpicxx
CHARMTOP=
#location where Grappa is installed, e.g. $(GRAPPA_PREFIX) if you've done "source <grappa install dir>/bin/settings.sh"
GRAPPATOP=
#location where Fine-Grain MPI is installed, e.g. $(HOME)/fgmpi-install
FGMPITOP=
#location where OpenCoarrays is installed, e.g. $(HOME)/opencoarrays
OCAS=
#location where Legion is installed, e.g. $(HOME)/legion
LEGIONTOP=
#location where ULFM-enabled MPI is installed
ULFMTOP=
#location where Fenix is installed
FENIXTOP=
Output showing problem
[Kernels]$ cd Cxx11/
[Cxx11]$ clang++ -DPRKVERSION="2.16" nstream-openmp-target.cc -fopenmp -DUSE_OPENMP -fopenmp-targets=nvptx64-nvidia-cuda -o nstream-openmp-target
[Cxx11]$ ./nstream-openmp-target 1 100000
Parallel Research Kernels version 2.16
C++11/OpenMP TARGET STREAM triad: A = B + scalar * C
Number of threads = 88
Number of iterations = 1
Vector length = 100000
Offset = 0
Solution validates
Rate (MB/s): inf Avg time (s): 0
If map(from:nstream_time) is removed, the bandwidth and time make more sense:
[Cxx11]$ clang++ -DPRKVERSION="2.16" nstream-openmp-target.cc -fopenmp -DUSE_OPENMP -fopenmp-targets=nvptx64-nvidia-cuda -o nstream-openmp-target
[Cxx11]$ ./nstream-openmp-target 1 100000
Parallel Research Kernels version 2.16
C++11/OpenMP TARGET STREAM triad: A = B + scalar * C
Number of threads = 88
Number of iterations = 1
Vector length = 100000
Offset = 0
Solution validates
Rate (MB/s): 7411.25 Avg time (s): 0.000431776
What type of issue is this?
This is about the line:
https://github.com/ParRes/Kernels/blob/443540af80438f3de4b2786f08e7673d7e83af6c/Cxx11/nstream-openmp-target.cc#L132
In this line, it says to map nstream_time from the device to host at the end of the target data region (
map(from:nstream_time)
). However, nstream_time is never set on the device (only the host modifies it), so instead the mapping from device to host overwrites the value that was in the variable, and then the timing and bandwidth information are not correct. By the openmp spec, since the variable is mapped asfrom
, space is allocated for it on the device (but left undefined), and at the end of the target data region, whatever value it was on the device is copied back to the host. Since nstream_time is never modified on the device, the value in it is not anything useful.I'll note that I only noticed this when trying to run on a discrete GPU like a V100.
I think
map(from:nstream_time)
should just be removed from the target data region.If this is a bug report, please use the following template. Otherwise, please delete the rest of the template.
Where does this bug appear?
Running on a system that offloads to a nvidia V100 with clang-ykt.
Check all that apply:
Operating system
What is the output of
uname -a
?Compiler
What is the output of
${COMPILER} -v
or${COMPILER} --version
?(clang-ykt, https://github.com/clang-ykt)
PRK build information
Please attach or inline
make.defs
.Output showing problem
If
map(from:nstream_time)
is removed, the bandwidth and time make more sense: