E3SM-Project / ACME-ECP

E3SM MMF for DoE ECP project
Other
9 stars 1 forks source link

Managed Memory / MPS slowdown on Summit #94

Open mrnorman opened 5 years ago

mrnorman commented 5 years ago

We're having an issue where the GPU code slows down significantly when using > 3 MPI tasks per GPU. This issue can be reproduced with a much small miniWeather code, so I'm putting that here. We have Nvidia looking into it. I just wanted it documented here. The workaround for now is going to be threading the physics and then running the CRM code in the master thread on the GPU.

########################################################
## Setup the environment (not on the login node please)
########################################################
bsub -alloc_flags gpumps -W 120 -nnodes 1 -P stf006 -Is /bin/bash
cd /gpfs/alpine/somewhere-in-here
module load pgi/19.4 parallel-netcdf
export PNETCDF_PATH=$OLCF_PARALLEL_NETCDF_ROOT
git clone git@github.com:mrnorman/miniWeather.git
cd miniWeather/fortran

########################################################
## Test Managed Memory and MPS
########################################################
git checkout managed-mps-summit  #Use the managed memory version
make openacc
jsrun --nrs 1 --cpu_per_rs 21 --gpu_per_rs 1 --rs_per_host 1 --tasks_per_rs $NTASKS ./miniWeather_mpi_openacc

########################################################
## Managed memory results (reproducible run-to-run)
########################################################
$NTASKS runtime
1       17.99697270000000
2       57.40020790000000
3       19.74362180000000
4       65.50862780000000
5       32.54771990000000
6       61.52739090000000
7       67.68837590000000

########################################################
## Test Explicit Memory and MPS
########################################################
git checkout explicit-mps-summit
make openacc
jsrun --nrs 1 --cpu_per_rs 21 --gpu_per_rs 1 --rs_per_host 1 --tasks_per_rs $NTASKS ./miniWeather_mpi_openacc

########################################################
## Explicit memory results (reproducible run-to-run)
########################################################
$NTASKS runtime
1       18.50592800000000
2       18.45240590000000
3       18.25672410000000
4       19.26276300000000
5       20.46120090000000
6       21.49306790000000
7       21.78880780000000