Closed helenarichie closed 1 year ago
Can you confirm whether a multi-GPU hydro only sim runs? If the hydro tests are still passing, it seems like this must be an issue that is isolated to the dust brach, since the hydro tests include a 4 GPU run.
@helenarichie and I (mostly Helena) are in progress on fixing this. Currently tests, including MPI ones, run fine on C-3PO so it might be an issue with the CRC PPC nodes or our config for them.
This appears to be limited to the dust build
I ran the sod256 simulation with the hydro build and it seemed to be working fine. The dust simulations I'm running are using the wind boundary and cloud initial conditions, so it could be a problem with the dust model or either of those and I'm working on checking them now.
After a little more investigation, we're still not sure what's going on. The line of cuda_boundaries.cu that's mentioned above (line 37) is still throwing the error even after adding a CudaCheckError();
call right before that line and the line after Wind_boundary_kernel
is launched in line 539 of cuda_boundaries.cu.
I also tried turning off the custom boundary conditions altogether in the input file and got the same error, so maybe it's not the wind boundary after all?
Local grid cells is somewhat large: 512 256 256
If you reduce to 256 128 128 will it work? You can either halve nx,ny,nz or mpirun with 16 instead of 2 (8x more GPUs)
On ppc, I also find that I need to do this:
export OMPI_MCA_oob="^ud"
export OMPI_MCA_btl="^openib"
to avoid the OpenMPI warnings/errors. But it's confusing if this is isolated to the dust build.
I tried running the same simulation with 4 processes instead of 2 and got the same error:
[her45@ppc-n0 2023-03-14]$ mpirun -np 4 ./cholla.dust.crc cloud-wind.txt
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: ppc-n0
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: ppc-n0
Local device: mlx5_0
--------------------------------------------------------------------------
Git Commit Hash = 5fa7087e41ba88a863d8e49d7bf2f4e7975c787e
Macro Flags = -DCUDA -DMPI_CHOLLA -DBLOCK -DPRECISION=2 -DPPMC -DHLLC -DAVERAGE_SLOW_CELLS -DTEMPERATURE_FLOOR -DVL -DSCALAR -DDUST -DCOOLING_GPU -DSLICES -DPROJECTION -DOUTPUT -DHDF5 -DMPI_GPU -DGIT_HASH=5fa7087e41ba88a863d8e49d7bf2f4e7975c787e
Parameter values: nx = 1024, ny = 256, nz = 256, tout = 60000.000000, init = Clouds, boundaries = 4 3 3 3 3 3
Output directory: ./hdf5/
Creating Log File: run_output.log
File exists, appending values: run_output.log
nproc_x 2 nproc_y 2 nproc_z 1
Allocating MPI communication buffers on GPU (nx = 786432, ny = 3194880, nz = 1697280).
Cloud positions: 0.080000 0.020000 0.020000
Cloud positions: 0.080000 0.020000 0.020000
Local number of grid cells: 512 128 256 18670080
Cloud positions: 0.080000 0.020000 0.020000
Setting initial conditions...
Cloud positions: 0.080000 0.020000 0.020000
Initial conditions set.
Setting boundary conditions...
Boundary conditions set.
Dimensions of each cell: dx = 0.000156 dy = 0.000156 dz = 0.000156
Ratio of specific heats gamma = 1.666667
Nstep = 0 Simulation time = 0.000000
Writing initial conditions to file...
Saving Snapshot: 0
[ppc-n0.crc.pitt.edu:279583] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[ppc-n0.crc.pitt.edu:279583] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ppc-n0.crc.pitt.edu:279583] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
Starting calculations.
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
CUDA ERROR AT LINE 37 OF FILE 'src/grid/cuda_boundaries.cu': cudaErrorIllegalAddress an illegal memory access was encountered
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[31627,1],2]
Exit code: 188
I also ran a 1024x256x256 hydro-only simulation and did not run into this issue. Note that that simulation was slightly smaller than the same size of simulation with the dust build because of the scalar field used in dust.
Did you also try with the OpenMPI export commands Alwin posted above?
Yes, those just get rid of this part of the message:
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: ppc-n0
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: ppc-n0
Local device: mlx5_0
--------------------------------------------------------------------------
Everything else is the same.
This should be fixed by PR #269 and @helenarichie has tested the fix.
After commit a94a4d2d on dev I am no longer able to run a simulation with the dust build on more than one GPU. I tried compiling both the dust and hydro build checked out to this commit and got the following error:
The next commit after this (4a92255) does compile for both dust and hydro builds, but when I actually run the simulation using the command
mpirun -np 2 ./cholla.dust.crc cloud-wind.txt
I get the following error:Note that I have no problem running a smaller version of this simulation that fits on one GPU with this version of Cholla. I am able to run it seemingly without any issues on one GPU with just the executable and the
mpirun
command. I'm also able to get it to "run" on one GPU when I run the larger simulation withmpirun
(although nothing is actually run because the GPU is overloaded). The problem occurs only when I specify that I want to use 2 GPUs withmpirun
.