Open denisbertini opened 1 year ago
Thanks for your interest in the code.
Could you share the content of your run-file.sh
script? In particular, are you calling an MPI runner, such as mpirun
, srun
or jsrun
from inside run-file.sh
?
for example submitting with the above slurm command gives at initialisation:
MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.05) initialized
PICSAR (1903ecfff51a)
WarpX (23.05)
__ __ __ __
\ \ / /_ _ _ __ _ __\ \/ /
\ \ /\ / / _` | '__| '_ \\ /
\ V V / (_| | | | |_) / \
\_/\_/ \__,_|_| | .__/_/\_\
|_|
Level 0: dt = 1.530214125e-17 ; dx = 5.580357143e-09 ; dz = 8.081896552e-09
Grids Summary:
Level 0 29 grids 9977856 cells 100 % of domain
smallest grid: 2688 x 128 biggest grid: 2688 x 128
Should it be HIP initialised with 4 devices
and 1 GPU device per MPI rank
?
run-file.sh
:
#!/bin/bash
#export CONT=/lustre/rz/dbertini/containers/prod/gpu/rlx8_rocm-5.4.3_warpx.sif
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx
export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs
srun --export=ALL -- $CONT warpx_2d $WDIR/scripts/inputs/warpx_opmd_deck
OK, thanks.
Based on the output that you sent, I think that WarpX is actually using 4 GPUs. I think that the message
HIP initialized with 1 device.
is to be understood per MPI rank.
@WeiqunZhang Could you confirm that this is the case?
unfortunately on the node where the job is running i can notice the use of only one GPU, as it is showed by the utility
program rocm-smi
this for example a snapshot of the rocm-smi
output when the job is running:
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 71.0c 198.0W 1502Mhz 1200Mhz 0% auto 290.0W 84% 99%
1 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 2% 0%
2 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 2% 0%
3 39.0c 34.0W 300Mz 1200Mhz 0% auto 290.0W 2% 0%
4 59.0c 40.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 40.0c 35.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
=====40===============9========================================================
=====40===============4====== End o ROCm SMI Log ==============================
one can see that only GPU at index 0 is used the other 4 requested are idle... any idea?
HIP initialized with 1 device.
means only one device is being used by all the processes. This is most likely a job script issue.
yes the question for me it still not clear what could be possibly wring in my job script ...
It depends on how slurm is configured on that system. Maybe try to change --cpus-per-task 1
. Figure out how many CPUs you have on a node and divide that by 4. The issue might be all your processes were using cpus that were close one GPU and that GPU was mapped to all 4 processes. There is nothing we can do in a C++ code, if the GPUs are not visible to us.
i have 96
processors on one machine, so if i change to --cpus-per-task 24
i still use only one GPU, does not help
Maybe instead of --gres=gpu:4
, you can try --gpus-per-taks=1
and --gpu-bind=verbose,single:1
.
doing your change i got the following error:
gpu-bind: usable_gres=0x8; bit_alloc=0xF; local_inx=4; global_list=3; local_list=3
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
gpu-bind: usable_gres=0x4; bit_alloc=0xF; local_inx=4; global_list=2; local_list=2
gpu-bind: usable_gres=0x2; bit_alloc=0xF; local_inx=4; global_list=1; local_list=1
amrex::Abort::1::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::2::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::3::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
so it seems that from warpx
or amrex
perspective, there is only one GPU device on the node ?
resubmitting with --ntasks-per-node 1
works fine, only one GPU is visible to warpx
The error message means processes 1, 2 and 3 see zero GPUs as reported by hipGetDeviceCount
. Only process 0 sees a GPU.
You can also run rocm-smi
instead of warpx. I suspect you will see the same behavior. That is only one GPU in total is available.
that is a good idea !
so rocm-smi
sees all the 8 devices
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 45.0c 35.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
1 43.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
2 43.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
3 44.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
4 41.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 39.0c 38.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
================================================================================
============================= End of ROCm SMI Log ==============================
and it has been launched using the same slurm command as for warpx
Did you run it under srun?
yes same command
additionnaly i ran another pic code based on gpu i.e picongpu
and this seems to see all devices and use all GPUS on the machine without problem
srun --export=ALL -- $CONT rocm-smi $WDIR/scripts/inputs/warpx_opmd_deck
?
no without the input deck file just
srun --export=ALL -- $CONT rocm-smi
What is /cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
?
this is a singularity container which contains all the software stack to run warpx
included rocm
Could you add the following lines after line 256 (device_id = my_rank % gpu_device_count;) of build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp
and recompile it?
amrex::AllPrint() << "Proc. " << ParallelDescriptor::MyProc()
<< ": nprocspernode = " << ParallelDescriptor::NProcsPerNode()
<< ", my_rank = " << my_rand << ", device count = "
<< gpu_device_count << "\n";
Hopefully this can give us more information.
resintalling with v 23.06
and the modifs on the AMREX code you asked for gives:
MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 1: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 0, device count = 4
HIP initialized with 1 device.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)
strange... my_rank
is always 0 ?
That my_rank
is the rank in a subcommunicator with the type of MPI_COMM_TYPE_SHARED. The issue is there is only one process per "node", probably because of the container. That is in this configuration, the CPUs and their memory are not shared, whereas the GPUs are shared.
You can try to map only one GPU to each MPI task in the slurm job script maybe with more explicit gpu mapping. Or you can modify your amrex souce code. You can make the following change that should work for your specific case.
diff --git a/Src/Base/AMReX_GpuDevice.cpp b/Src/Base/AMReX_GpuDevice.cpp
index d709531440..cfd7a39e5c 100644
--- a/Src/Base/AMReX_GpuDevice.cpp
+++ b/Src/Base/AMReX_GpuDevice.cpp
@@ -253,6 +253,7 @@ Device::Initialize ()
// ranks to GPUs, assuming that socket awareness has already
// been handled.
+ my_rank = ParallelDescriptor::MyProc();
device_id = my_rank % gpu_device_count;
// If we detect more ranks than visible GPUs, warn the user
We will try to fix this in the next release.
But as well gpu_device_count
is not correct, it should be 8
and not 4
in my case...
I think that's because of --ntasks-per-node 4 --cpus-per-task 1 --gres=gpu:4
.
ah that is correct ! thx !
Your patch seems to work:
MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 1: nprocspernode = 1, my_rank = 1, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 3, device count = 4
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 2, device count = 4
HIP initialized with 4 devices.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)
But it is quite a change in the AMReX logic !
and controlling with rocm-smi
indeed shows the proper GPU usage:
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 63.0c 100.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 92%
1 61.0c 326.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 99%
2 72.0c 249.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 99%
3 62.0c 252.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 99%
4 43.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 42.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 42.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 40.0c 37.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
================================================================================
============================= End of ROCm SMI Log ==============================
But i see that the GPU usage is varying between0-99 %
, Is this correct ?
Is there a way with WarpX to measure the usage efficiency when running with GPU ?
ROCm has a profiling tool. You can try that.
@denisbertini just curious about the Singularity container, which I used before.
Without the patch above that @WeiqunZhang suggests, doesn't one usually start them as:
$ srun -n <NUMBER_OF_RANKS> singularity exec <PATH/TO/MY/IMAGE.sif> </PATH/TO/BINARY/WITHIN/CONTAINER>
https://docs.sylabs.io/guides/3.3/user-guide/mpi.html
So in your case
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx
export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs
srun --export=ALL singularity exec $CONT warpx_2d $WDIR/scripts/inputs/warpx_opmd_deck
?
Yes, this is actually exactly the way i do.
The other script i sent before is just another variant.
I this case the warpx are already installed within the container and all warpx_1,2,3d
executables can then be
executed directly.
Awesome. One hint for parallel sims, you can also try out our dynamic load balancing capabilties.
With the AMReX block size, you can aim to create 4-12 blocks per GPU so the algorithm can move them around based on the cost function you pick. (Of course, your problem needs to be large enough to not underutilize the GPUs with too little work.) Generally, the Knapsack distribution works well and you can use CPU and GPU timers or heuristics for cost estimates.
Is your original issue addressed? We could continue in new issues if this is all set now :)
this is now set, waiting for the next release to solve that eventually.
@ax3l BTW i created different singularity definition files which could eventually allow any user to run WarpX on any HPC system which provide singularity/apptainer as container technique. Defintiion files for CPUs and GPU (AMD+ROCM) are provided.
May be interesting for WarpX user?
@WeiqunZhang just checking, are you working on a related AMReX PR that we should link & track? :)
@denisbertini awesome :star_struck: moved to #3994 to keep things organized :)
Yes, I have some thoughts on how to handle this.
@denisbertini Could you please give https://github.com/AMReX-Codes/amrex/pull/3382 a try and let us know if it works for you?
I am trying to use more than one GPU for warpx on our AMD machine without success. The setting is the following
sbatch --reservation gpu_tests --nodes 1 --ntasks-per-node 4 --cpus-per-task 1 --gres=gpu:4 --mem-per-gpu 48000 --no-requeue --job-name warpx --mail-type ALL --mail-user d.bertini@gsi.de --partition gpu --time 0-8:00:00 -D ./ -o %j.out.log -e %j.err.log --nodelist=lxbk1099 ./run-file.sh
From the ressource selection used (
--gres=gpu:4
) i would expect the usage of 4 gpus. Instead only one is used . Is there anything else one should be aware when running on multi gpus machine? If yes could you deliver me an example i can test on our system ?