running on multiGPU machine

denisbertini commented 1 year ago

I am trying to use more than one GPU for warpx on our AMD machine without success. The setting is the following

AMD GPU MI 100 with ROCM 5.4.x
warpx using the std [laser_ion](example https://github.com/ECP-WarpX/WarpX/blob/development/Examples/Physics_applications/laser_ion/inputs)
submit slurm command:sbatch --reservation gpu_tests --nodes 1 --ntasks-per-node 4 --cpus-per-task 1 --gres=gpu:4 --mem-per-gpu 48000 --no-requeue --job-name warpx --mail-type ALL --mail-user d.bertini@gsi.de --partition gpu --time 0-8:00:00 -D ./ -o %j.out.log -e %j.err.log --nodelist=lxbk1099 ./run-file.sh

From the ressource selection used (--gres=gpu:4) i would expect the usage of 4 gpus. Instead only one is used . Is there anything else one should be aware when running on multi gpus machine? If yes could you deliver me an example i can test on our system ?

RemiLehe commented 1 year ago

Thanks for your interest in the code. Could you share the content of your run-file.sh script? In particular, are you calling an MPI runner, such as mpirun, srun or jsrun from inside run-file.sh?

denisbertini commented 1 year ago

for example submitting with the above slurm command gives at initialisation:


MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.05) initialized
PICSAR (1903ecfff51a)
WarpX (23.05)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 1.530214125e-17 ; dx = 5.580357143e-09 ; dz = 8.081896552e-09

Grids Summary:
  Level 0   29 grids  9977856 cells  100 % of domain
            smallest grid: 2688 x 128  biggest grid: 2688 x 128

Should it be HIP initialised with 4 devices and 1 GPU device per MPI rank?

denisbertini commented 1 year ago

run-file.sh :

#!/bin/bash

#export CONT=/lustre/rz/dbertini/containers/prod/gpu/rlx8_rocm-5.4.3_warpx.sif
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx

export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs

srun --export=ALL -- $CONT warpx_2d  $WDIR/scripts/inputs/warpx_opmd_deck

RemiLehe commented 1 year ago

OK, thanks. Based on the output that you sent, I think that WarpX is actually using 4 GPUs. I think that the message HIP initialized with 1 device. is to be understood per MPI rank. @WeiqunZhang Could you confirm that this is the case?

denisbertini commented 1 year ago

unfortunately on the node where the job is running i can notice the use of only one GPU, as it is showed by the utility program rocm-smi

denisbertini commented 1 year ago

this for example a snapshot of the rocm-smi output when the job is running:

GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    71.0c           198.0W  1502Mhz  1200Mhz  0%   auto  290.0W   84%   99%
1    38.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    2%   0%
2    38.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    2%   0%
3    39.0c           34.0W   300Mz   1200Mhz  0%   auto  290.0W    2%   0%
4    59.0c           40.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
5    38.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
6    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
7    40.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
=====40===============9========================================================
=====40===============4====== End o ROCm SMI Log ==============================

one can see that only GPU at index 0 is used the other 4 requested are idle... any idea?

WeiqunZhang commented 1 year ago

HIP initialized with 1 device. means only one device is being used by all the processes. This is most likely a job script issue.

denisbertini commented 1 year ago

yes the question for me it still not clear what could be possibly wring in my job script ...

WeiqunZhang commented 1 year ago

It depends on how slurm is configured on that system. Maybe try to change --cpus-per-task 1. Figure out how many CPUs you have on a node and divide that by 4. The issue might be all your processes were using cpus that were close one GPU and that GPU was mapped to all 4 processes. There is nothing we can do in a C++ code, if the GPUs are not visible to us.

denisbertini commented 1 year ago

i have 96 processors on one machine, so if i change to --cpus-per-task 24 i still use only one GPU, does not help

WeiqunZhang commented 1 year ago

Maybe instead of --gres=gpu:4, you can try --gpus-per-taks=1 and --gpu-bind=verbose,single:1.

denisbertini commented 1 year ago

doing your change i got the following error:

gpu-bind: usable_gres=0x8; bit_alloc=0xF; local_inx=4; global_list=3; local_list=3
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
gpu-bind: usable_gres=0x4; bit_alloc=0xF; local_inx=4; global_list=2; local_list=2
gpu-bind: usable_gres=0x2; bit_alloc=0xF; local_inx=4; global_list=1; local_list=1
amrex::Abort::1::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::2::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::3::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!

denisbertini commented 1 year ago

so it seems that from warpx or amrex perspective, there is only one GPU device on the node ?

denisbertini commented 1 year ago

resubmitting with --ntasks-per-node 1 works fine, only one GPU is visible to warpx

WeiqunZhang commented 1 year ago

The error message means processes 1, 2 and 3 see zero GPUs as reported by hipGetDeviceCount. Only process 0 sees a GPU.

You can also run rocm-smi instead of warpx. I suspect you will see the same behavior. That is only one GPU in total is available.

denisbertini commented 1 year ago

that is a good idea !

denisbertini commented 1 year ago

so rocm-smi sees all the 8 devices

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    45.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
1    43.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
2    43.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
3    44.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
4    41.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
5    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
6    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
7    39.0c           38.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

denisbertini commented 1 year ago

and it has been launched using the same slurm command as for warpx

WeiqunZhang commented 1 year ago

Did you run it under srun?

denisbertini commented 1 year ago

yes same command

denisbertini commented 1 year ago

additionnaly i ran another pic code based on gpu i.e picongpu and this seems to see all devices and use all GPUS on the machine without problem

WeiqunZhang commented 1 year ago

srun --export=ALL -- $CONT rocm-smi $WDIR/scripts/inputs/warpx_opmd_deck?

denisbertini commented 1 year ago

no without the input deck file just srun --export=ALL -- $CONT rocm-smi

WeiqunZhang commented 1 year ago

What is /cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif?

denisbertini commented 1 year ago

this is a singularity container which contains all the software stack to run warpx

denisbertini commented 1 year ago

included rocm

WeiqunZhang commented 1 year ago

Could you add the following lines after line 256 (device_id = my_rank % gpu_device_count;) of build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp and recompile it?

        amrex::AllPrint() << "Proc. " << ParallelDescriptor::MyProc()
                          << ": nprocspernode = " << ParallelDescriptor::NProcsPerNode()
                          << ", my_rank = " << my_rand << ", device count = "
                          << gpu_device_count << "\n";

Hopefully this can give us more information.

denisbertini commented 1 year ago

resintalling with v 23.06 and the modifs on the AMREX code you asked for gives:

MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 1: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 0, device count = 4
HIP initialized with 1 device.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)

strange... my_rank is always 0 ?

WeiqunZhang commented 1 year ago

That my_rank is the rank in a subcommunicator with the type of MPI_COMM_TYPE_SHARED. The issue is there is only one process per "node", probably because of the container. That is in this configuration, the CPUs and their memory are not shared, whereas the GPUs are shared.

You can try to map only one GPU to each MPI task in the slurm job script maybe with more explicit gpu mapping. Or you can modify your amrex souce code. You can make the following change that should work for your specific case.

diff --git a/Src/Base/AMReX_GpuDevice.cpp b/Src/Base/AMReX_GpuDevice.cpp
index d709531440..cfd7a39e5c 100644
--- a/Src/Base/AMReX_GpuDevice.cpp
+++ b/Src/Base/AMReX_GpuDevice.cpp
@@ -253,6 +253,7 @@ Device::Initialize ()
         // ranks to GPUs, assuming that socket awareness has already
         // been handled.

+        my_rank = ParallelDescriptor::MyProc();
         device_id = my_rank % gpu_device_count;

         // If we detect more ranks than visible GPUs, warn the user

We will try to fix this in the next release.

denisbertini commented 1 year ago

But as well gpu_device_count is not correct, it should be 8 and not 4 in my case...

WeiqunZhang commented 1 year ago

I think that's because of --ntasks-per-node 4 --cpus-per-task 1 --gres=gpu:4.

denisbertini commented 1 year ago

ah that is correct ! thx !

denisbertini commented 1 year ago

Your patch seems to work:

MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 1: nprocspernode = 1, my_rank = 1, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 3, device count = 4
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 2, device count = 4
HIP initialized with 4 devices.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)

But it is quite a change in the AMReX logic !

denisbertini commented 1 year ago

and controlling with rocm-smi indeed shows the proper GPU usage:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    63.0c           100.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   92%   
1    61.0c           326.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   99%   
2    72.0c           249.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   99%   
3    62.0c           252.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   99%   
4    43.0c           39.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
5    42.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
6    42.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
7    40.0c           37.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

But i see that the GPU usage is varying between0-99 %, Is this correct ? Is there a way with WarpX to measure the usage efficiency when running with GPU ?

WeiqunZhang commented 1 year ago

ROCm has a profiling tool. You can try that.

https://docs.amd.com/bundle/ROCm-Profiling-Tools-User-Guide-v5.3/page/Introduction_to_ROCm_Profiling_Tools_User_Guide.html

ax3l commented 1 year ago

@denisbertini just curious about the Singularity container, which I used before.

Without the patch above that @WeiqunZhang suggests, doesn't one usually start them as:

$ srun -n <NUMBER_OF_RANKS> singularity exec <PATH/TO/MY/IMAGE.sif> </PATH/TO/BINARY/WITHIN/CONTAINER>

https://docs.sylabs.io/guides/3.3/user-guide/mpi.html

So in your case

export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx

export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs

srun --export=ALL singularity exec $CONT warpx_2d $WDIR/scripts/inputs/warpx_opmd_deck

?

denisbertini commented 1 year ago

Yes, this is actually exactly the way i do.

denisbertini commented 1 year ago

The other script i sent before is just another variant. I this case the warpx are already installed within the container and all warpx_1,2,3d executables can then be executed directly.

ax3l commented 1 year ago

Awesome. One hint for parallel sims, you can also try out our dynamic load balancing capabilties.

With the AMReX block size, you can aim to create 4-12 blocks per GPU so the algorithm can move them around based on the cost function you pick. (Of course, your problem needs to be large enough to not underutilize the GPUs with too little work.) Generally, the Knapsack distribution works well and you can use CPU and GPU timers or heuristics for cost estimates.

https://warpx.readthedocs.io/en/latest/usage/domain_decomposition.html
https://warpx.readthedocs.io/en/latest/usage/parameters.html#distribution-across-mpi-ranks-and-parallelization
Rowan ME, Gott KN, Deslippe J, Huebl A, Thevenet M, Lehe R, Vay JL. In-situ assessment of device-side compute work for dynamic load balancing in a GPU-accelerated PIC code. PASC ‘21: Proceedings of the Platform for Advanced Scientific Computing Conference. 2021 July, 10, pages 1-11. DOI:10.1145/3468267.3470614

Is your original issue addressed? We could continue in new issues if this is all set now :)

denisbertini commented 1 year ago

this is now set, waiting for the next release to solve that eventually.

denisbertini commented 1 year ago

@ax3l BTW i created different singularity definition files which could eventually allow any user to run WarpX on any HPC system which provide singularity/apptainer as container technique. Defintiion files for CPUs and GPU (AMD+ROCM) are provided.

https://git.gsi.de/d.bertini/pp-containers

May be interesting for WarpX user?

ax3l commented 1 year ago

@WeiqunZhang just checking, are you working on a related AMReX PR that we should link & track? :)

@denisbertini awesome :star_struck: moved to #3994 to keep things organized :)

WeiqunZhang commented 1 year ago

Yes, I have some thoughts on how to handle this.

WeiqunZhang commented 1 year ago

@denisbertini Could you please give https://github.com/AMReX-Codes/amrex/pull/3382 a try and let us know if it works for you?

ECP-WarpX / WarpX

running on multiGPU machine #3967