Minor bug in nstream-openmp-target causing nstream_time variable to be overwritten

colleeneb commented 5 years ago

What type of issue is this?

[x] Bug in the code or other problem
[ ] Inadequate/incorrect documation
[ ] Feature request

This is about the line:

https://github.com/ParRes/Kernels/blob/443540af80438f3de4b2786f08e7673d7e83af6c/Cxx11/nstream-openmp-target.cc#L132

In this line, it says to map nstream_time from the device to host at the end of the target data region (map(from:nstream_time)). However, nstream_time is never set on the device (only the host modifies it), so instead the mapping from device to host overwrites the value that was in the variable, and then the timing and bandwidth information are not correct. By the openmp spec, since the variable is mapped as from, space is allocated for it on the device (but left undefined), and at the end of the target data region, whatever value it was on the device is copied back to the host. Since nstream_time is never modified on the device, the value in it is not anything useful.

I'll note that I only noticed this when trying to run on a discrete GPU like a V100.

I think map(from:nstream_time) should just be removed from the target data region.

If this is a bug report, please use the following template. Otherwise, please delete the rest of the template.

Where does this bug appear?

Running on a system that offloads to a nvidia V100 with clang-ykt.

Check all that apply:

[ ] MacOS
[x] Linux
[ ] Cray
[ ] GCC
[x] Clang
[ ] Intel compiler
[ ] MPICH and derivatives (MVAPICH2, Intel MPI, Cray MPI, etc.)
[ ] Open-MPI

Operating system

What is the output of uname -a?

Linux gpu02.ftm.alcf.anl.gov 3.10.0-862.14.4.el7.x86_64 #1 SMP Fri Sep 21 09:07:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Compiler

What is the output of ${COMPILER} -v or ${COMPILER} --version?

clang version 9.0.0

(clang-ykt, https://github.com/clang-ykt)

PRK build information

Please attach or inline make.defs.

[bertoni@gpu02 Kernels]$ cat common/make.defs
#name of MPI C compiler, e.g. mpiicc, mpicc
MPICC=

#name of C compiler, e.g. icc, xlc, gcc
CC=clang

#name of MPI Fortran compiler, e.g. mpifort, mpif90
MPIF90=

#name of Fortran compiler, e.g. ifort, xlf_r, gfortran
FC=

#name of compile line flag enabling OpenMP, e.g. -openmp, -qopenmp, -fopenmp
OPENMPFLAG=-fopenmp
OFFLOADFLAG= -fopenmp-targets=nvptx64-nvidia-cuda

#default compiler optimization flags
DEFAULT_OPT_FLAGS:=

############################   OPTIONAL #########################

# Fortran 2008 coarrays flag, *including any library*
# ifort: -coarray=distributed, gfortran: -fcoarray(=single) or -fcoarray=lib -lcaf_mpi, crayftn: -h caf
COARRAYFLAG=

#name of C++ compiler (to be used in MPI context for Grappa), e.g. mpigxx, mpiicpc
CXX=clang++

#name of UPC compiler, e.g. gupc, cc, upcc
UPCC=

#name of compile line flag enabling UPC if necessary, e.g. -h upc
UPCFLAG=

#name of MPI C compiler (to be used in Fine-Grain MPI context), e.g. mpicc
FGMPICC=

#name of C compiler (to be used in MPI context of OpenSHMEM), e.g. $(MPICC)
SHMEMCC=

#location where Charm++ is installed, e.g. $(HOME)/charm/mpi-linux-x86_64-ifort-smp-mpicxx
CHARMTOP=

#location where Grappa is installed, e.g. $(GRAPPA_PREFIX) if you've done "source <grappa install dir>/bin/settings.sh"
GRAPPATOP=

#location where Fine-Grain MPI is installed, e.g. $(HOME)/fgmpi-install
FGMPITOP=

#location where OpenCoarrays is installed, e.g. $(HOME)/opencoarrays
OCAS=

#location where Legion is installed, e.g. $(HOME)/legion
LEGIONTOP=

#location where ULFM-enabled MPI is installed
ULFMTOP=

#location where Fenix is installed
FENIXTOP=

Output showing problem

[Kernels]$ cd Cxx11/
[Cxx11]$ clang++ -DPRKVERSION="2.16" nstream-openmp-target.cc -fopenmp -DUSE_OPENMP -fopenmp-targets=nvptx64-nvidia-cuda -o nstream-openmp-target
[Cxx11]$  ./nstream-openmp-target 1 100000
Parallel Research Kernels version 2.16
C++11/OpenMP TARGET STREAM triad: A = B + scalar * C
Number of threads    = 88
Number of iterations = 1
Vector length        = 100000
Offset               = 0
Solution validates
Rate (MB/s): inf Avg time (s): 0

If map(from:nstream_time) is removed, the bandwidth and time make more sense:

[Cxx11]$ clang++ -DPRKVERSION="2.16" nstream-openmp-target.cc -fopenmp -DUSE_OPENMP -fopenmp-targets=nvptx64-nvidia-cuda -o nstream-openmp-target
[Cxx11]$  ./nstream-openmp-target 1 100000
Parallel Research Kernels version 2.16
C++11/OpenMP TARGET STREAM triad: A = B + scalar * C
Number of threads    = 88
Number of iterations = 1
Vector length        = 100000
Offset               = 0
Solution validates
Rate (MB/s): 7411.25 Avg time (s): 0.000431776

jeffhammond commented 5 years ago

Please try https://github.com/jeffhammond/PRK master branch, which should fix this issue in both C++ and Fortran codes.

colleeneb commented 5 years ago

That worked, thanks!

ParRes / Kernels