Running on 1 or 2 A100 GPUs gives memory out of bound error

hummingtree commented 3 years ago

Compiling and running Cholla (Makefile and input deck attached) on 1 or 2 A100 GPUs gives the following out of bound memory error (error message obtained with compute-sanitizer), and the program exits normally with very few (3) number of steps. Note that runs on 4 or 8 A100 GPUs, and runs on 1, 2, 4, 8 V100 GPUs are good.

I do not know we are doing something wrong here?

Cholla std output

Parameter values:  nx = 640, ny = 640, nz = 640, tout = 0.200000, init = Riemann, boundaries = 3 3 3 3 3 3
Output directory:  ./

Creating Log File: run_output.log

nproc_x 2 nproc_y 1 nproc_z 1
Allocating MPI communication buffers (nx = 8192000, ny = 4198400, nz = 4250880).
Local number of grid cells: 320 640 640 137728512
Setting initial conditions...
Initial conditions set.
Setting boundary conditions...
Boundary conditions set.
Dimensions of each cell: dx = 0.001563 dy = 0.001563 dz = 0.001563
Ratio of specific heats gamma = 1.400000
Nstep = 0  Timestep = 0.000000  Simulation time = 0.000000
Writing initial conditions to file...

Saving Snapshot: 0
Starting calculations.
n_step: 1   sim time:  0.0005282   sim timestep: 5.2822e-04  timestep time =  3599.491 ms   total time =   24.3431 s

n_step: 2   sim time:  0.1000000   sim timestep: 9.9472e-02  timestep time =   713.766 ms   total time =   25.0569 s

Saving Snapshot: 1
n_step: 3   sim time:  0.2000000   sim timestep: 1.0000e-01  timestep time =  1575.741 ms   total time =   41.1236 s

compute-sanitizer error

========= Invalid __global__ read of size 8 bytes
=========     at 0xe00 in Update_Conserved_Variables_3D_half(double*, double*, double*, double*, double*, int, int, int, int, double, double, double, double, double, int, double)
=========     by thread (233,0,0) in block (1335,0,0)
=========     Address 0x7f6bdd68ed48 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x25428a]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x115ab]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x618c0]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
=========     Host Frame: [0x2458f]
=========                in /cholla/cholla/cholla
=========     Host Frame: [0x151d2]
=========                in /cholla/cholla/cholla
=========     Host Frame: [0x5362]
=========                in /cholla/cholla/cholla
=========     Host Frame:__libc_start_main [0x270b3]
=========                in /usr/lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x62ae]
=========                in /cholla/cholla/cholla

Makefile

DIRS := src src/gravity src/particles src/cosmology src/cooling

CFILES := $(foreach DIR,$(DIRS),$(wildcard $(DIR)/*.c))
CPPFILES := $(foreach DIR,$(DIRS),$(wildcard $(DIR)/*.cpp))
GPUFILES := $(foreach DIR,$(DIRS),$(wildcard $(DIR)/*.cu))

OBJS := $(subst .c,.o,$(CFILES)) $(subst .cpp,.o,$(CPPFILES)) $(subst .cu,.o,$(GPUFILES))
CUOBJS := $(subst .cu,.o,$(GPUFILES))

#To use GPUs, CUDA must be turned on here
#Optional error checking can also be enabled
DFLAGS += -DCUDA #-DCUDA_ERROR_CHECK

#To use MPI, DFLAGS must also include -DMPI_CHOLLA
DFLAGS += -DMPI_CHOLLA -DBLOCK

#Set the MPI Processes grid [nproc_x, nproc_y, nproc_z]
#DFLAGS += -DSET_MPI_GRID

#Limit the number of steps
#DFLAGS += -DN_STEPS_LIMIT=26

# Single or double precision
#DFLAGS += -DPRECISION=1
DFLAGS += -DPRECISION=2

#Set output preferences
DFLAGS += -DOUTPUT
#DFLAGS += -DBINARY
DFLAGS += -DHDF5
DFLAGS += -DSLICES
#DFLAGS += -DPROJECTION
#DFLAGS += -DROTATED_PROJECTION

#Output all data every N_OUTPUT_COMPLETE snapshots ( These are Restart Files )
#DFLAGS += -DN_OUTPUT_COMPLETE=10

# Reconstruction
#DFLAGS += -DPCM
#DFLAGS += -DPLMP
#DFLAGS += -DPLMC
DFLAGS += -DPPMP
#DFLAGS += -DPPMC

# Riemann Solver
#DFLAGS += -DEXACT
#DFLAGS += -DROE
DFLAGS += -DHLLC

# Integrator
#DFLAGS += -DCTU
DFLAGS += -DVL

# Use Dual Energy Formalism
#DFLAGS += -DDE

# Evolve additional scalars
#DFLAGS += -DSCALAR

# Apply a minimum value to Conserved values
#DFLAGS += -DDENSITY_FLOOR
#DFLAGS += -DTEMPERATURE_FLOOR

# Average Slow cell when the cell delta_t is very small
#DFLAGS += -DAVERAGE_SLOW_CELLS

# Allocate GPU memory every timestep
#DFLAGS += -DDYNAMIC_GPU_ALLOC

# Set the cooling function
#DFLAGS += -DCOOLING_GPU 
#DFLAGS += -DCLOUDY_COOL

# Use Tiled Iitial Conditions for Scaling Tets
#DFLAGS += -DTILED_INITIAL_CONDITIONS

# Print Initial Statistics
#DFLAGS += -DPRINT_INITIAL_STATS

# Print some timing stats
#DFLAGS += -DCPU_TIME

# Include FFT gravity
#DFLAGS += -DGRAVITY
#DFLAGS += -DPFFT
#DFLAGS += -DCUFFT
#DFLAGS += -DCOUPLE_GRAVITATIONAL_WORK
#DFLAGS += -DCOUPLE_DELTA_E_KINETIC
#DFLAGS += -DOUTPUT_POTENTIAL
#DFLAGS += -DGRAVITY_5_POINTS_GRADIENT

# Include Gravity From Particles PM
#DFLAGS += -DPARTICLES
#DFLAGS += -DPARTICLES_CPU
#DFLAGS += -DPARTICLES_GPU
#DFLAGS += -DONLY_PARTICLES
#DFLAGS += -DSINGLE_PARTICLE_MASS
#DFLAGS += -DPARTICLE_IDS

# Turn OpenMP on for CPU calculations
#DFLAGS += -DPARALLEL_OMP
#OMP_NUM_THREADS ?= 16
#DFLAGS += -DN_OMP_THREADS=$(OMP_NUM_THREADS)
#DFLAGS += -DPRINT_OMP_DOMAIN

# Cosmological simulation
#DFLAGS += -DCOSMOLOGY

# Use Grackle for cooling in cosmological simulations
#DFLAGS += -DCOOLING_GRACKLE

CC ?= cc
CXX ?= CC
CFLAGS += -g -Ofast
CXXFLAGS += -g -Ofast -std=c++14
CFLAGS += $(DFLAGS) -Isrc
CXXFLAGS += $(DFLAGS) -Isrc
GPUFLAGS += $(DFLAGS) -Isrc

ifeq ($(findstring -DPFFT,$(DFLAGS)),-DPFFT)
    CXXFLAGS += -I$(FFTW_ROOT)/include -I$(PFFT_ROOT)/include
    GPUFLAGS += -I$(FFTW_ROOT)/include -I$(PFFT_ROOT)/include
    LIBS += -L$(FFTW_ROOT)/lib -L$(PFFT_ROOT)/lib -lpfft -lfftw3_mpi -lfftw3
endif

ifeq ($(findstring -DCUFFT,$(DFLAGS)),-DCUFFT)
    LIBS += -lcufft
endif

ifeq ($(findstring -DHDF5,$(DFLAGS)),-DHDF5)
    CFLAGS += -I$(HDF5INCLUDE)
    CXXFLAGS += -I$(HDF5INCLUDE)
    GPUFLAGS += -I$(HDF5INCLUDE)
    LIBS += -L$(HDF5DIR) -lhdf5
endif

ifeq ($(findstring -DMPI_CHOLLA,$(DFLAGS)),-DMPI_CHOLLA)
    CC = mpicc
    CXX = mpicxx
    GPUFLAGS += -I$(MPI_HOME)/include
endif

ifeq ($(findstring -DCUDA,$(DFLAGS)),-DCUDA)
    GPUCXX := nvcc
    GPUFLAGS += --expt-extended-lambda -g -O3 -arch sm_80 -fmad=false
    LD := $(CXX)
    LDFLAGS := $(CXXFLAGS)
    LIBS += -L$(CUDA_DIR)/lib64 -lcudart
endif

ifeq ($(findstring -DCOOLING_GRACKLE,$(DFLAGS)),-DCOOLING_GRACKLE)
    DFLAGS += -DCONFIG_BFLOAT_8
    DFLAGS += -DOUTPUT_TEMPERATURE
    DFLAGS += -DOUTPUT_CHEMISTRY
    #DFLAGS += -DOUTPUT_ELECTRONS
    #DFLAGS += -DOUTPUT_FULL_IONIZATION
    #DFLAGS += -DOUTPUT_METALS
    DFLAGS += -DSCALAR
    DFLAGS += -DN_OMP_THREADS_GRACKLE=12
    CXXFLAGS += -I/ccs/proj/ast149/code/grackle/include
    LIBS += -L/ccs/proj/ast149/code/grackle/lib -lgrackle
endif

ifeq ($(findstring -DPARALLEL_OMP,$(DFLAGS)),-DPARALLEL_OMP)
    CXXFLAGS += -fopenmp
    LDFLAGS += -fopenmp
endif

.SUFFIXES: .c .cpp .cu .o

EXEC := cholla$(SUFFIX)

$(EXEC): $(OBJS) src/gpuCode.o
    $(LD) $(LDFLAGS) $(OBJS) src/gpuCode.o -o $(EXEC) $(LIBS)

%.o:    %.c
        $(CC) $(CFLAGS) -c $< -o $@

%.o:    %.cpp
        $(CXX) $(CXXFLAGS) -c $< -o $@

%.o:    %.cu
        $(GPUCXX) $(GPUFLAGS) --device-c -c $< -o $@

src/gpuCode.o:  $(CUOBJS)
        $(GPUCXX) -dlink $(GPUFLAGS) $(CUOBJS) -o src/gpuCode.o

.PHONY : clean

clean:
     rm -f $(OBJS) src/gpuCode.o $(EXEC)

Input deck

#

# Parameter File for 3D Sod Shock tube

#

################################################

# number of grid cells in the x dimension

nx=640

# number of grid cells in the y dimension

ny=640

# number of grid cells in the z dimension

nz=640

# final output time

tout=0.2

# time interval for output

outstep=0.1

# name of initial conditions

init=Riemann

#init=Read_Grid

# domain properties

xmin=0.0

ymin=0.0

zmin=0.0

xlen=1.0

ylen=1.0

zlen=1.0

# type of boundary conditions

xl_bcnd=3

xu_bcnd=3

yl_bcnd=3

yu_bcnd=3

zl_bcnd=3

zu_bcnd=3

# path to output directory

outdir=./

#################################################

# Parameters for 1D Riemann problems

# density of left state

rho_l=1.0

# velocity of left state

v_l=0

# pressure of left state

P_l=1.0

# density of right state

rho_r=0.1

# velocity of right state

v_r=0

# pressure of right state

P_r=0.1

# location of initial discontinuity

diaph=0.5

# value of gamma

gamma=1.4

evaneschneider commented 3 years ago

I am surprised it runs correctly on a single V100. That is a very large grid for a single GPU - there is probably not enough memory space to run it all at once, so it is getting split up and my best guess is that something is going wrong with the free memory calculation for the A100s. What is the global memory size for the A100's and V100's you are running on?

evaneschneider commented 3 years ago

There is a function in "subgrid_routines_3D.cu" that checks for the amount of free memory available on the device and then determines the size of the sub-blocks to split the volume into for cases where the problem size is too big to fit in the GPU global memory. To my knowledge, this has not been tested on an A100, and it's possible it behaves differently than it did on the V100. If the sub-block size is too big, it could cause the kind of error you are seeing.

hummingtree commented 3 years ago

@evaneschneider In our runs A100 has 80 GB of memory and V100 has 16 GB.

Do you mean the function cudaMemGetInfo? I do not think it behaves differently between V100 and A100.

evaneschneider commented 3 years ago

Yeah, that's the function. My guess is that if the code runs properly on 4 or 8 A100's but not 1 or 2, that probably means that on 4 the grid is no longer needs to be split up and everything works fine, but on 1 or 2 it should be, and perhaps something is going wrong. Assuming you are running in the main branch, you could try checking this by printing out the "sub grid dimensions" in the file VL_3D_cuda.cu (that is, uncomment line 51).

alwinm commented 1 year ago

I marked with a label so that future users know this might be a useful thread to look at, closing as "not planned" for now, feel free to re-open if needed.

cholla-hydro / cholla

Running on 1 or 2 A100 GPUs gives memory out of bound error #70