cholla-hydro / cholla

A GPU-based hydro code
https://github.com/cholla-hydro/cholla/wiki
MIT License
65 stars 32 forks source link

multi-GPU simulations broken for dust build #266

Closed helenarichie closed 1 year ago

helenarichie commented 1 year ago

After commit a94a4d2d on dev I am no longer able to run a simulation with the dust build on more than one GPU. I tried compiling both the dust and hydro build checked out to this commit and got the following error:

[her45@ppc-n0 cholla]$ make TYPE=hydro -j
builds/prereq.sh build crc
mpicc -Ofast -DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH='"a94a4d2d0556407b94021a6dbb2e18b9915c8e45"' -DMACRO_FLAGS='"-DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH='"a94a4d2d0556407b94021a6dbb2e18b9915c8e45"'"' -Isrc -c src/mpi/MPI_Comm_node.c -o src/mpi/MPI_Comm_node.o
mpicxx -Ofast -std=c++17 -DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH='"a94a4d2d0556407b94021a6dbb2e18b9915c8e45"' -DMACRO_FLAGS='"-DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH='"a94a4d2d0556407b94021a6dbb2e18b9915c8e45"'"' -Isrc -I/ihome/crc/install/power9/hdf5/1.12.0/build-gcc-10.1.0/include -I/ihome/crc/install/power9/cuda/11.1.0/include -c src/analysis/analysis.cpp -o src/analysis/analysis.o
src/mpi/MPI_Comm_node.c:2:12: fatal error: ../mpi/MPI_Comm_node.h: No such file or directory
    2 |   #include "../mpi/MPI_Comm_node.h"
      |            ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:206: src/mpi/MPI_Comm_node.o] Error 1
make: *** Waiting for unfinished jobs....

The next commit after this (4a92255) does compile for both dust and hydro builds, but when I actually run the simulation using the command mpirun -np 2 ./cholla.dust.crc cloud-wind.txt I get the following error:

[her45@ppc-n0 2023-03-10]$ mpirun -np 2 ./cholla.dust.crc cloud-wind.txt 
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              ppc-n0
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   ppc-n0
  Local device: mlx5_0
--------------------------------------------------------------------------
Git Commit Hash = 4a92255f98962ac2be1bf7eeeb5111042d096437
Macro Flags     = -DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH=4a92255f98962ac2be1bf7eeeb5111042d096437
Parameter values:  nx = 1024, ny = 256, nz = 256, tout = 60000.000000, init = Clouds, boundaries = 4 3 3 3 3 3
Output directory:  ./hdf5/

Creating Log File: run_output.log 

  File exists, appending values: run_output.log 

nproc_x 2 nproc_y 1 nproc_z 1
Allocating MPI communication buffers on GPU (nx = 1572864, ny = 3194880, nz = 3294720).
Cloud positions: 0.080000 0.020000 0.020000
Local number of grid cells: 512 256 256 36241920
Setting initial conditions...
Cloud positions: 0.080000 0.020000 0.020000
Initial conditions set.
Setting boundary conditions...
Boundary conditions set.
Dimensions of each cell: dx = 0.000156 dy = 0.000156 dz = 0.000156
Ratio of specific heats gamma = 1.666667
Nstep = 0  Simulation time = 0.000000
Writing initial conditions to file...

Saving Snapshot: 0 
[ppc-n0.crc.pitt.edu:119982] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[ppc-n0.crc.pitt.edu:119982] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ppc-n0.crc.pitt.edu:119982] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
Starting calculations.
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60223,1],1]
  Exit code:    188

Note that I have no problem running a smaller version of this simulation that fits on one GPU with this version of Cholla. I am able to run it seemingly without any issues on one GPU with just the executable and the mpirun command. I'm also able to get it to "run" on one GPU when I run the larger simulation with mpirun (although nothing is actually run because the GPU is overloaded). The problem occurs only when I specify that I want to use 2 GPUs with mpirun.

evaneschneider commented 1 year ago

Can you confirm whether a multi-GPU hydro only sim runs? If the hydro tests are still passing, it seems like this must be an issue that is isolated to the dust brach, since the hydro tests include a 4 GPU run.

bcaddy commented 1 year ago

@helenarichie and I (mostly Helena) are in progress on fixing this. Currently tests, including MPI ones, run fine on C-3PO so it might be an issue with the CRC PPC nodes or our config for them.

bcaddy commented 1 year ago

This appears to be limited to the dust build

helenarichie commented 1 year ago

I ran the sod256 simulation with the hydro build and it seemed to be working fine. The dust simulations I'm running are using the wind boundary and cloud initial conditions, so it could be a problem with the dust model or either of those and I'm working on checking them now.

helenarichie commented 1 year ago

After a little more investigation, we're still not sure what's going on. The line of cuda_boundaries.cu that's mentioned above (line 37) is still throwing the error even after adding a CudaCheckError(); call right before that line and the line after Wind_boundary_kernel is launched in line 539 of cuda_boundaries.cu.

I also tried turning off the custom boundary conditions altogether in the input file and got the same error, so maybe it's not the wind boundary after all?

alwinm commented 1 year ago

Local grid cells is somewhat large: 512 256 256

If you reduce to 256 128 128 will it work? You can either halve nx,ny,nz or mpirun with 16 instead of 2 (8x more GPUs)

alwinm commented 1 year ago

On ppc, I also find that I need to do this:

export OMPI_MCA_oob="^ud"                                                   

export OMPI_MCA_btl="^openib"

to avoid the OpenMPI warnings/errors. But it's confusing if this is isolated to the dust build.

helenarichie commented 1 year ago

I tried running the same simulation with 4 processes instead of 2 and got the same error:

[her45@ppc-n0 2023-03-14]$ mpirun -np 4 ./cholla.dust.crc cloud-wind.txt 
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              ppc-n0
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   ppc-n0
  Local device: mlx5_0
--------------------------------------------------------------------------
Git Commit Hash = 5fa7087e41ba88a863d8e49d7bf2f4e7975c787e
Macro Flags     = -DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH=5fa7087e41ba88a863d8e49d7bf2f4e7975c787e
Parameter values:  nx = 1024, ny = 256, nz = 256, tout = 60000.000000, init = Clouds, boundaries = 4 3 3 3 3 3
Output directory:  ./hdf5/

Creating Log File: run_output.log 

  File exists, appending values: run_output.log 

nproc_x 2 nproc_y 2 nproc_z 1
Allocating MPI communication buffers on GPU (nx = 786432, ny = 3194880, nz = 1697280).
Cloud positions: 0.080000 0.020000 0.020000
Cloud positions: 0.080000 0.020000 0.020000
Local number of grid cells: 512 128 256 18670080
Cloud positions: 0.080000 0.020000 0.020000
Setting initial conditions...
Cloud positions: 0.080000 0.020000 0.020000
Initial conditions set.
Setting boundary conditions...
Boundary conditions set.
Dimensions of each cell: dx = 0.000156 dy = 0.000156 dz = 0.000156
Ratio of specific heats gamma = 1.666667
Nstep = 0  Simulation time = 0.000000
Writing initial conditions to file...

Saving Snapshot: 0 
[ppc-n0.crc.pitt.edu:279583] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[ppc-n0.crc.pitt.edu:279583] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ppc-n0.crc.pitt.edu:279583] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
Starting calculations.
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31627,1],2]
  Exit code:    188
helenarichie commented 1 year ago

I also ran a 1024x256x256 hydro-only simulation and did not run into this issue. Note that that simulation was slightly smaller than the same size of simulation with the dust build because of the scalar field used in dust.

evaneschneider commented 1 year ago

Did you also try with the OpenMPI export commands Alwin posted above?

helenarichie commented 1 year ago

Yes, those just get rid of this part of the message:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              ppc-n0
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   ppc-n0
  Local device: mlx5_0
--------------------------------------------------------------------------

Everything else is the same.

bcaddy commented 1 year ago

This should be fixed by PR #269 and @helenarichie has tested the fix.