ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
694 stars 218 forks source link

Out-of-memory in multiple GPU mode, ROCm 4.3.1 on AMD MI 100 GPUs #3958

Closed denisbertini closed 1 year ago

denisbertini commented 2 years ago

Hi I am able to run PicOnGPU (dev branch) on our AMD MI 100 GPUs cluster but only on single GPU mode. As soon as i try to run the code in multiple GPU mode with more MPI tasks, the PicOnGPU process is killed by the OS and the slurm scheduler report an Out Of Memory errors:

slurmstepd: error: Detected 74 oom-kill event(s) in step 36403717.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: lxbk1122: task 0: Out Of Memory

And the out of memory failure always happened just after the program initialisation:

PIConGPU: 0.7.0-dev
  Build-Type: Release

Third party:
  OS:         Linux-3.10.0-1160.31.1.el7.x86_64
  arch:       x86_64
  CXX:        Clang (13.0.0)
  CMake:      3.20.5
  Boost:      1.75.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (4.0.3)
  PNGwriter:  0.7.0
  openPMD:    0.14.3
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00229 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.0247974)
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 23592960
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 360000 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 360000 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0 keV and corresponding Debye length 0 m.
   The grid has 0 cells per average Debye length
initialization time:  6sec 281msec = 6.281 sec

This is the GPU mapping i used to submit picongpu

#SBATCH -J pog_1
#SBATCH -o /lustre/rz/dbertini/gpu/data/lwfa_002/pog_1_%j.out
#SBATCH -e /lustre/rz/dbertini/gpu/data/lwfa_002/pog_1_%j.err
#SBATCH -D /lustre/rz/dbertini/gpu/data/lwfa_002/
#SBATCH --partition gpu 
#SBATCH --gres=gpu:8  # number of GPUs per node
#SBATCH -t 7-00:00:00
#SBATCH --nodes=1         # nb of nodes    
#SBATCH --ntasks=8        # nb of MPI tasks
#SBATCH --cpus-per-task=4 #CPU core per MPI processes
#SBATCH --gpu-bind=closest
#SBATCH --mem=64G

and the options used for picongpu corresponding to this mapping are:

/lustre/rz/dbertini/gpu/data/lwfa_002/input/bin/picongpu  -d 2 4 1                    -g 192 2048 240                      -s 4000                         -m --windowMovePoint 0.9                     --e_png.period 100                                --e_png.axis yx --e_png.slicePoint 0.5            --e_png.folder pngElectronsYX                                 --e_energyHistogram.period 100                     --e_energyHistogram.binCount 1024                  --e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000                  --e_energyHistogram.filter all                           --e_phaseSpace.period 100                                      --e_phaseSpace.space y --e_phaseSpace.momentum py              --e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0                 --e_phaseSpace.filter all                               --e_macroParticlesCount.period 100                          --openPMD.period 100               --openPMD.file simData             --openPMD.ext bp             --checkpoint.backend openPMD             --checkpoint.period 100                          --versionOnce | tee output

Something seems to be wrong in the definition of this mapping. Any ideas what could be wrong here ?

sbastrakov commented 2 years ago

Hello @denisbertini .

So PIConGPU seems to be starting and crashing soon after initialization starts. However particles are generated and Debye length involves a kernel, so it's not literally the first memory allocation or kernel launch that fails. Could you try increasing reserve memory size here.

This error at this point sounds strangely familiar, but I couldn't find it right away.

denisbertini commented 2 years ago

I should change the line

constexpr size_t reservedGpuMemorySize = 350 * 1024 * 1024;

but what value?

sbastrakov commented 2 years ago

This is size that picongpu leaves free on each GPU. For sake of testing, please try a very small grid size in your .cfg file and leave e.g. 1 GB per GPU, so 1024 1024 1024

sbastrakov commented 2 years ago

Note that you need to rebuild after changing that file, same as any other .param.

sbastrakov commented 2 years ago

Also your #SBATCH --mem=64G appears way too low. As far as I can see, it is for the whole node with 8 GPUs. Normally with PIConGPU allocated host memory size should be at least same size as memory of all used GPUs of a node combined. In case there is relatively little host memory on a system, could you also try using fewer GPUs to check if this is problematic.

sbastrakov commented 2 years ago

Now thinking of it, 64 GB host memory requested may be causing this issue.

We always allocate all GPU memory except that reservedGpuMemorySize regardless of actual simulation size. And host generally needs at least same amount of memory per each GPU, so that host-device buffers can exist. So with 8 MI 100 GPUs one needs 8 x 32 GB host memory I guess? Or use fewer GPUs to match the host memory size

bussmann commented 2 years ago

I may remember an early issue on AMD which looked like only 50% of GPU memory could be allocated. Was it this one you remembered @sbastrakov ?

sbastrakov commented 2 years ago

Ah yes, that one! So now two independent things to investigate: host memory allocated size and device one

denisbertini commented 2 years ago

So

SBATCH -o pog_%j.out

SBATCH -e pog_%j.err


You see that i added there the allocation via `--gres=gpu:n_gpus` and commented some other options that i do not use for the GPU mapping defintiion.
Using such a defintion i run such job on our cluster ( `4.cfg` config is used)

obName : lwfa_015 singularity
Submit : 2022-01-11 14:29:39 2022-01-11 14:29:41
Start : 2022-01-11 14:29:40 2022-01-11 14:29:41
End : Unknown Unknown
UserCPU : 00:00:00 00:00:00
TotalCPU : 00:00:00 00:00:00
JobID : 36453737 36453737.0
JobIDRaw : 36453737 36453737.0
JobName : lwfa_015 singularity
Partition : gpu
NTasks : 4
AllocCPUS : 64 4
Elapsed : 00:16:37 00:16:36
State : RUNNING RUNNING
ExitCode : 0:0 0:0
AveCPUFreq : 0
ReqCPUFreqMin : Unknown Unknown
ReqCPUFreqMax : Unknown Unknown
ReqCPUFreqGov : Unknown Unknown
ReqMem : 128Gn 128Gn
ConsumedEnergy : 0
AllocGRES : gpu:4 gpu:4
ReqGRES : gpu:0 gpu:0
ReqTRES : billing=4+
AllocTRES : billing=1+ cpu=4,gre+
TotalReqMem : 512 GB 512 GB


Here you see that the `4 GPus` are allocated and that there is indeed 4  tasks and 64 allocated CPUS.
My questions:
- Does this output sounds correct to you?
- Should the options i commented be added ?
- Is my asumption 1 MPI tasks per GPU device correct?
- to run in multi nodes, should i just increase now the number of tasks? if yes i can not use `--gres` option anymore... is such a option relevant for `picongpu`?
denisbertini commented 2 years ago

Another question, how to define the macro_particle/cell number in picongpu and what is the default used?

denisbertini commented 2 years ago

Another problem, trying to run with the std 4.cfg definition file i got a crash in openMPI:

Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:20216] *** An error occurred in MPI_Isend
[lxbk1122:20216] *** reported by process [4080535318,3]
[lxbk1122:20216] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
[lxbk1122:20216] *** MPI_ERR_OTHER: known error not in list
[lxbk1122:20216] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1122:20216] ***    and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 36455560.0 ON lxbk1122 CANCELLED AT 2022-01-11T15:13:17 ***
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:20217] *** An error occurred in MPI_Isend
[lxbk1122:20217] *** reported by process [4080535318,1]
[lxbk1122:20217] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15

any idea what is the problem with the transport?

denisbertini commented 2 years ago

For openMPI i used the recommended options

# setup openMPI
export PMIX_MCA_gds=^ds21
export OMPI_MCA_io=^ompio
export OMPI_MCA_mpi_leave_pinned=0
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000
denisbertini commented 2 years ago

forget about my last MPI noise, the proper openib settings needed to be added !

denisbertini commented 2 years ago

well when i increase the grid siz, i still get the openMPI crash:

Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:24195] *** An error occurred in MPI_Isend
[lxbk1122:24195] *** reported by process [3365471779,2]
[lxbk1122:24195] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
[lxbk1122:24195] *** MPI_ERR_OTHER: known error not in list
[lxbk1122:24195] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1122:24195] ***    and potentially your MPI job)

Is there a way to overcome this MPI limit ?

sbastrakov commented 2 years ago

Thanks for a detailed description. Let me reply to your points separately

I changed reservedGpuMemorySize to 1G and rebuild. still the picongpu process run out of memory.

So i think reservedGpuMemorySize is now cleared of suspicion and can be reverted to our default value.

host memory to 128G instead of 64G and now it seems to work for 2 MPI processes with a relatively small grid size

Okay, so the host memory size may be the issue. As i mentioned above, for our memory allocation size grid size should not matter as we try to take all but reservedGpuMemorySize for any simulation. I wanted to try small size first just to not worry about this.

picongpu/etc/spock-ornl since the hardware setup is more or less the same and the script use sbatch as a scheduler. I modified the tpl file though in order to launch picongpu within a singularity container. And it works fine for me now.

Makes sense.

(BTW we could add this the your list of setup example in picongpu/etc. It could help users which want to use docker or singularity to run picongpu.)

We have some docs about docker here. But sure singularity use case should be documented as well, and perhaps also some general info about using PIConGPU with containers can be added.

You see that i added there the allocation via --gres=gpu:n_gpus and commented some other options that i do not use for > the GPU mapping defintiion.

In my experience, SLURM can be configured differently on different systems, e.g. which subset of its redundant set of variables is used. This is one of reasons we try to isolate it in .tpl files. Of course, it comes at a price that the first user on a system has to figure it our and set it up.

Is my asumption 1 MPI tasks per GPU device correct

Correct, this is only mode we support (or to be more precise, we run 1 MPI process on what is exposed as a GPU for the job). Your general output seems okay to me. I am no expert and generally, again, the right subset of SLURM veriables needs to be figured out for each system.

to run in multi nodes, should i just increase now the number of tasks? if yes i can not use --gres option anymore... is such a option relevant for picongpu?

Yes, increase the number of tasks, the number of nodes will be calculated from it by contents of the .tpl file. I am not sure what is a problem with --gres, it is requirements per node. Again, what you have in the .tpl should already manage it properly, and if there is a problem we can adjust for it.

Could you also attach your current .tpl version so that we are on the same page?

sbastrakov commented 2 years ago

Another question, how to define the macro_particle/cell number in picongpu and what is the default used?

It is set by a user. The naming and location is admittedly not obvious and can be improved. To use our LWFA example, when initializing a species, usually constructs like this are used. The second template parameter, in that case startPosition::Random2ppc defines both number of macroparticles per cell and how they are initially distributes inside a cell. We try to name this type accordingly, but it is merely an alias. It is defined in particle.param, for the LWFA example here. By changing numParticlesPerCell there you can control ppc. There is also a doc page on macroparticle sampling here.

denisbertini commented 2 years ago

I just modified again the famous "system dependent" SLURM variables and now i am able to run picongpu with the full 8 GPUs on one node. It seems to run well if the grid size is adjusted not to trigger the openMPI crash i mentioned above. I attach also my current template .tpl virgo.tpl.txt

denisbertini commented 2 years ago

BTW feel free to correct/change things in the template

sbastrakov commented 2 years ago

Could you also attach your .cfg file that triggers that openMPI error? So that we can figure our if PIConGPU should be sending this amount of data at all

sbastrakov commented 2 years ago

Just as a stupid, but quick, things to try. We normally only use export OMPI_MCA_io=^ompio and no other settings for openMPI in profiles. Does the issue persist for this case?

denisbertini commented 2 years ago

the cfg that trigger the crash in openMPI is the following:

# Copyright 2013-2021 Axel Huebl, Rene Widera, Felix Schmitt, Franz Poeschel
#
# This file is part of PIConGPU.
#
# PIConGPU is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PIConGPU is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PIConGPU.
# If not, see <http://www.gnu.org/licenses/>.
#

##
## This configuration file is used by PIConGPU's TBG tool to create a
## batch script for PIConGPU runs. For a detailed description of PIConGPU
## configuration files including all available variables, see
##
##                      docs/TBG_macros.cfg
##

#################################
## Section: Required Variables ##
#################################

TBG_wallTime="2:00:00"

TBG_devices_x=2
TBG_devices_y=4
TBG_devices_z=1

#TBG_gridSize="192 2048 160"
TBG_gridSize="192 1024 160"
TBG_steps="4000"

# leave TBG_movingWindow empty to disable moving window
TBG_movingWindow="-m --windowMovePoint 0.9"

#################################
## Section: Optional Variables ##
#################################

# png image output (rough electron density and laser preview)
TBG_pngYX="--e_png.period 100                     \
           --e_png.axis yx --e_png.slicePoint 0.5 \
           --e_png.folder pngElectronsYX"

# energy histogram (electrons, [keV])
TBG_e_histogram="--e_energyHistogram.period 100    \
                 --e_energyHistogram.binCount 1024 \
                 --e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000 \
                 --e_energyHistogram.filter all"

# longitudinal phase space (electrons, [m_e c])
TBG_e_PSypy="--e_phaseSpace.period 100                         \
             --e_phaseSpace.space y --e_phaseSpace.momentum py \
             --e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0    \
             --e_phaseSpace.filter all"

TBG_openPMD="--openPMD.period 100   \
            --openPMD.file simData \
            --openPMD.ext bp \
            --checkpoint.backend openPMD \
            --checkpoint.period 100
            --checkpoint.restart.backend openPMD"

# macro particle counter (electrons, debug information for memory)
TBG_e_macroCount="--e_macroParticlesCount.period 100"

TBG_plugins="!TBG_pngYX                    \
             !TBG_e_histogram              \
             !TBG_e_PSypy                  \
             !TBG_e_macroCount             \
             !TBG_openPMD"

#################################
## Section: Program Parameters ##
#################################

TBG_deviceDist="!TBG_devices_x !TBG_devices_y !TBG_devices_z"

TBG_programParams="-d !TBG_deviceDist \
                   -g !TBG_gridSize   \
                   -s !TBG_steps      \
                   !TBG_movingWindow  \
                   !TBG_plugins       \
                   --versionOnce"

# TOTAL number of devices
TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))"

"$TBG_cfgPath"/submitAction.sh

i just commented the grid size generating the crash and reduce the Y_dim1 by factor 2 to oversome the openMPI limitation

denisbertini commented 2 years ago

The output of the simulation using the modified 8.cfg seems to be ok:

Running program...
PIConGPU: 0.7.0-dev
  Build-Type: Release

Third party:
  OS:         Linux-3.10.0-1160.31.1.el7.x86_64
  arch:       x86_64
  CXX:        Clang (13.0.0)
  CMake:      3.20.5
  Boost:      1.75.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (4.0.3)
  PNGwriter:  0.7.0
  openPMD:    0.14.3
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00229 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.0247974)
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 7864320
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 117120 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 117120 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0 keV and corresponding Debye length 0 m.
   The grid has 0 cells per average Debye length
initialization time:  4sec 332msec = 4.332 sec
  0 % =        0 | time elapsed:            17sec 449msec | avg time per step:   0msec
  5 % =      200 | time elapsed:            31sec 758msec | avg time per step:  14msec
 10 % =      400 | time elapsed:            58sec 329msec | avg time per step:  13msec
 15 % =      600 | time elapsed:       1min 26sec 395msec | avg time per step:  16msec
 20 % =      800 | time elapsed:       1min 56sec 448msec | avg time per step:  19msec
 25 % =     1000 | time elapsed:       2min 26sec 663msec | avg time per step:  28msec
 30 % =     1200 | time elapsed:       2min 59sec 763msec | avg time per step:  23msec
 35 % =     1400 | time elapsed:       3min 29sec 455msec | avg time per step:  25msec
 40 % =     1600 | time elapsed:       4min  1sec 669msec | avg time per step:  25msec
 45 % =     1800 | time elapsed:       4min 33sec 885msec | avg time per step:  27msec
 50 % =     2000 | time elapsed:       5min  6sec 221msec | avg time per step:  33msec
 55 % =     2200 | time elapsed:       5min 38sec 681msec | avg time per step:  29msec
 60 % =     2400 | time elapsed:       6min 12sec 680msec | avg time per step:  28msec
 65 % =     2600 | time elapsed:       6min 44sec 568msec | avg time per step:  26msec
 70 % =     2800 | time elapsed:       7min 14sec 657msec | avg time per step:  22msec
 75 % =     3000 | time elapsed:       7min 44sec 405msec | avg time per step:  28msec
 80 % =     3200 | time elapsed:       8min 12sec 640msec | avg time per step:  22msec
 85 % =     3400 | time elapsed:       8min 41sec 251msec | avg time per step:  21msec
 90 % =     3600 | time elapsed:       9min 17sec 513msec | avg time per step:  21msec
 95 % =     3800 | time elapsed:       9min 46sec 544msec | avg time per step:  21msec
100 % =     4000 | time elapsed:      10min 14sec 462msec | avg time per step:  20msec
calculation  simulation time: 10min 26sec 649msec = 626.649 sec
full simulation time: 10min 31sec 765msec = 631.765 sec

Since i do not have any performance reference, is the wall_time of 10min 31 sec ok for such a simulation ?

sbastrakov commented 2 years ago

Thanks I will now do some back of the envelope estimates of how communication should happen then.

For performance, i do not have a reference in my head for LWFA. You can try running our benchmark setup for which we have an idea how it should perform

denisbertini commented 2 years ago

OK thanks ! Do you know also if there is some documentation related to this LWFA example?

sbastrakov commented 2 years ago

If you mean specifically for LWFA, there is only a small doc page here. In case there are some physics questions, my colleagues could help (i am a computer scientist).

denisbertini commented 2 years ago

OK thanks a lot !

denisbertini commented 2 years ago

BTW this is exactly the hardware setup we have, just double GPU/RAM/Cores. setup I was trying to use it to find out optimal options for SLURM, but if you can help, i would appreciate !

sbastrakov commented 2 years ago

So for that .cfg file there is no way PIConGPU should be attempting sending with message size 1223060560. It could either be a result of some error in PIConGPU or an issue of OpenMPI or a misreported issue. With all 3 it's weird how did we never see it before. To investigate further, you could rebuild PIConGPU in debug mode like described here with 127 for PIC_VERBOSE and PMACC_VERBOSE, run and attach stdout and stderr. Then we may be able to get the message sizes PIConGPU requested to send.

sbastrakov commented 2 years ago

BTW this is exactly the hardware setup we have, just double GPU/RAM/Cores. setup I was trying to use it to find out optimal options for SLURM, but if you can help, i would appreciate !

Ideally, your system documentation or admin should have the recommended ways of submitting jobs. We normally start from there when setting PIConGPU on a new system, and then adjust / make support tickets when something does not work (of course, depending on IT infrastructure and workpower). In case there is none, that linked docs of a similar system is a good start. I think generally if one has some working configuration that allows running jobs, it's then most reasonable to make sure MPI and all needed dependencies (openPMD API etc.) work fine. The .tpl file can be refined later as well.

denisbertini commented 2 years ago

Sure, and i think i will discover more things along the way

denisbertini commented 2 years ago

BTW, what is the procedure to run the benchmarks tests, like any other examples i.e using pic-create and pic-build?

sbastrakov commented 2 years ago

Yes, it is just another example, but made specifically for performance measurements

denisbertini commented 2 years ago

Ok thanks!

denisbertini commented 2 years ago

@sbastrakov running the benchmarks with config 1,cfg i got the following results:

initialization time:  1sec 312msec = 1.312 sec
  0 % =        0 | time elapsed:                   74msec | avg time per step:   0msec
  5 % =       50 | time elapsed:             1sec 778msec | avg time per step:  34msec
 10 % =      100 | time elapsed:             3sec 543msec | avg time per step:  35msec
 15 % =      150 | time elapsed:             5sec 275msec | avg time per step:  34msec
 20 % =      200 | time elapsed:             6sec 984msec | avg time per step:  34msec
 25 % =      250 | time elapsed:             8sec 666msec | avg time per step:  33msec
 30 % =      300 | time elapsed:            10sec 363msec | avg time per step:  33msec
 35 % =      350 | time elapsed:            12sec   9msec | avg time per step:  32msec
 40 % =      400 | time elapsed:            13sec 615msec | avg time per step:  32msec
 45 % =      450 | time elapsed:            15sec 260msec | avg time per step:  32msec
 50 % =      500 | time elapsed:            17sec 109msec | avg time per step:  36msec
 55 % =      550 | time elapsed:            18sec 870msec | avg time per step:  34msec
 60 % =      600 | time elapsed:            20sec 604msec | avg time per step:  34msec
 65 % =      650 | time elapsed:            22sec 333msec | avg time per step:  34msec
 70 % =      700 | time elapsed:            24sec  45msec | avg time per step:  34msec
 75 % =      750 | time elapsed:            25sec 784msec | avg time per step:  34msec
 80 % =      800 | time elapsed:            27sec 458msec | avg time per step:  33msec
 85 % =      850 | time elapsed:            29sec 152msec | avg time per step:  33msec
 90 % =      900 | time elapsed:            30sec 829msec | avg time per step:  33msec
 95 % =      950 | time elapsed:            32sec 523msec | avg time per step:  33msec
100 % =     1000 | time elapsed:            34sec 203msec | avg time per step:  33msec
calculation  simulation time: 34sec 215msec = 34.215 sec
full simulation time: 35sec 615msec = 35.615 sec
denisbertini commented 2 years ago

the other config 1_radiation.cfg used an unrecognized option

unrecognised option '--e_radiation.period'
sbastrakov commented 2 years ago

34 seconds is reasonable, that should be about 1 ns per particle update. So performance-wise on 1 GPU all seems good on your side.

1_radiation.cfg requires radiation plugin to run. The plugin is conditionally enabled if you have a supported openPMD API version with the HDF5 backend. Currently it is the only remaining plugin requiring a specific backend. So could it be that you have openPMD API (I assume, since you ran LWFA before edit: I now see you have it in previously attached logs) but with ADIOS backend? Merely for testing purposes i think the first run is sufficient however.

denisbertini commented 2 years ago

No i did not up to now used ADIOS plugin for the IO. If the first benchmarks is enough to fully test, then it is enough for me !

steindev commented 2 years ago

@denisbertini Sorry for the late reply. Do I understand correctly, that you can still not run on more than two GPUs? Could you post the tbg/submit.start from the simulation directory of one of your crashing simulations? Furthermore, where does your system differ from Spock (https://docs.olcf.ornl.gov/systems/spock_quick_start_guide.html#spock-compute-nodes)? (number of cores, number of GPUs, host memory?)

Also, set the reserved memory in memory.param to 2GiB. We found on Spock that simulations with lower values crash often while this value works most of the time.

Also, when you experience crashes, try to run without openPMD output, it requires lots of host memory. Yet I doubt that this is your problem with the LWFA example.

denisbertini commented 2 years ago

I can run now on more than 2 GPUs. My last try was actually running picongpu on 16 nodes each having 8 GPUs and it works. I already set reserved memory to 1G in memory.param.

Very interesting fact though: Indeed i noticed that when i switched on the openPMD output, the program crashes or get even stuck and i need to kill it manually . I also have to say that i increased the grid size of the standard 16.cfg. i have now the following setup:

TBG_wallTime="5:00:00"

TBG_devices_x=4
TBG_devices_y=8
TBG_devices_z=4

TBG_gridSize="512 1536 512"

Which of course will dump much more data to output then the sandard simulation.

I am not sure the problem is linked to memory. If memory usage would be too high the system would have killed the corresponding processes and SLURM would have returned OUT_OF_MEMORY error status. This is not the case here. For example sometimes the program crashes with such errors comming from MPI

picongpu: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.

pointing here to a possible race condition problem. Reducing the number of openMP threads reduces occurence of such an error.

When i run without openPMD output... everything works stable. The question is that, how one can work without dumping the full field and particle data ? Is there a way to avoid the full dumping in picongpu, other plugins?

steindev commented 2 years ago

In order to assess the problem with openPMD, please don't forego answering the following questions: Could you post the tbg/submit.start from the simulation directory of one of your crashing simulations? Furthermore, where does your system differ from Spock (https://docs.olcf.ornl.gov/systems/spock_quick_start_guide.html#spock-compute-nodes)? (number of cores, number of GPUs, host memory?)

Simulations without full output are often possible. Using e.g. the phase space plugin or tracer and probe particles to study particle and field evolution. See manual (https://picongpu.readthedocs.io/en/0.6.0/usage/plugins/phaseSpace.html#usage-plugins-phasespace), (https://picongpu.readthedocs.io/en/0.6.0/usage/workflows/tracerParticles.html) and (https://picongpu.readthedocs.io/en/0.6.0/usage/workflows/probeParticles.html), respectively. However, if you want to create checkpoints as saves for long running simulations or to restart simulations from an intermediate step with possibly altered parameters from this step on, you will need the capability to write full simulation output.

denisbertini commented 2 years ago

OK sorry ! Llet me try to answer your questions. Differences from Spock are minor:

denisbertini commented 2 years ago

and the script submit.start i am using is the following:

#!/bin/bash
# Copyright 2013-2021 Axel Huebl, Richard Pausch, Rene Widera, Sergei Bastrakov, Klaus Steinger
#
# This file is part of PIConGPU.
#
# PIConGPU is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PIConGPU is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PIConGPU.
# If not, see <http://www.gnu.org/licenses/>.
#

# PIConGPU batch script for spock's SLURM batch system

#SBATCH --partition=gpu
#SBATCH --time=5:00:00
#SBATCH --job-name=lwfa_002
#SBATCH --nodes=16                           # Nb of nodes
#SBATCH --ntasks=128                          # Nb of MPI tasks
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=12           # CPU Cores per MPI process
#SBATCH --mem=0                                   # Requested Total Job Memory / Node  
#SBATCH --mem-per-gpu=64000000
#SBATCH --gpu-bind=closest 
#SBATCH --mail-type=NONE
#SBATCH --mail-user=d.bertini@gsi.de
#SBATCH --chdir=/lustre/rz/dbertini/gpu/data/lwfa_002
#SBATCH -o pog_%j.out
#SBATCH -e pog_%j.err
## calculations will be performed by tbg ##

# settings that can be controlled by environment variables before submit

# number of available/hosted devices per node in the system

# host memory per device

# number of CPU cores to block per GPU
# we have 12 CPU cores per GPU (96cores/8gpus ~ 12cores)
#.TBG_coresPerGPU=16

# Assign one OpenMP thread per available core per GPU (=task)
#export OMP_NUM_THREADS=12
export OMP_NUM_THREADS=1

# required GPUs per node for the current job

# We only start 1 MPI task per device

# use ceil to caculate nodes

## end calculations ##

echo 'Running program...'

cd /lustre/rz/dbertini/gpu/data/lwfa_002

export MODULES_NO_OUTPUT=1
source /lustre/rz/dbertini/gpu/picongpu.profile
if [ $? -ne 0 ] ; then
  echo "Error: PIConGPU environment profile under \"/lustre/rz/dbertini/gpu/picongpu.profile\" not found!"
  exit 1
fi
unset MODULES_NO_OUTPUT

# set user rights to u=rwx;g=r-x;o=---
umask 0027

echo "creating simOutput directory ... with delay"
mkdir simOutput 2> /dev/null
sleep 2
cd simOutput
 Compilers
export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/bin/hipcc

# Main environment variables
export PICHOME=/lustre/rz/dbertini/gpu/picongpu
export PICSRC=$PICHOME
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PATH=$PICSRC:$PATH
export PATH=$PICSRC/bin:$PATH
export PATH=$PICSRC/src/tools/bin:$PATH
export PYTHONPATH=$PICSRC/lib/python:$PYTHONPATH

export ADIOS2_DIR=/opt/adios/2.7.1
export openPMD_DIR=/opt/openPMD-api/0.14.3/
export PNGwriter_DIR=/opt/pngwriter/0.7.0/
export ISAAC_DIR=/opt/isaac/1.5.2/

# output data
export SCRATCH=/lustre/rz/dbertini/gpu/data
export WORKDIR=/lustre/rz/dbertini/gpu

### environment
export PATH=/usr/local/bin:$PATH
export PATH=/opt/rocm/bin:$PATH

# ## picongpu dependencies
export LD_LIBRARY_PATH=/opt/adios/1.13.1/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/boost/1.75.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/hdf5/1.10.7/lib:$LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/opt/icet/2.9.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/isaac/1.5.2/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/janson/2.9.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/libsplash/1.7.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/libpngwriter/0.7.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/openPMD-api/0.14.3/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH

# add  necessary rocm-libs part
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
# add adios 2
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/adios/2.7.1/bin:$PATH
# add  necessary rocm-libs part
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
# add adios 2
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/adios/2.7.1/bin:$PATH

export CPLUS_INCLUDE_PATH=/opt/adios/1.13.1/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/adios/2.7.1/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/boost/1.75.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/hdf5/1.10.7/include:$CPLUS_INCLUDE_PATH

export CPLUS_INCLUDE_PATH=/opt/icet/2.9.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/isaac/1.5.2/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/janson/2.9.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/libsplash/1.7.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/pngwriter/0.7.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/openPMD-api/0.14.3/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/rocm/include:$CPLUS_INCLUDE_PATH

export PATH=/opt/adios/1.13.1/bin:$PATH
export PATH=/opt/hdf5/1.10.7/bin:$PATH
export PATH=/opt/libsplash/1.7.0/bin:$PATH
export PATH=/opt/openPMD-api/0.14.3/bin:$PATH

## add hip+adios
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
export PATH=/opt/adios/2.7.1/bin:$PATH

# setup openMPI
export PMIX_MCA_gds=^ds21
#export OMPI_MCA_io=^ompio

export OMPI_MCA_io=romio321
export ROMIO_HINTS=./my_romio_hints
cat << EOF > ./my_romio_hints
romio_cb_write enable
romio_ds_write enable
cb_buffer_size 16777216
cb_nodes 16
EOF

export OMPI_MCA_mpi_leave_pinned=0
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000

export OPENPMD_BP_BACKEND=ADIOS2

cat > /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh <<EOF
#!/bin/bash

# Compilers
export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/bin/hipcc

# Main environment variables
export PICHOME=/lustre/rz/dbertini/gpu/picongpu
export PICSRC=$PICHOME
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PATH=$PICSRC:$PATH
export PATH=$PICSRC/bin:$PATH
export PATH=$PICSRC/src/tools/bin:$PATH
export PYTHONPATH=$PICSRC/lib/python:$PYTHONPATH
export OPENPMD_BP_BACKEND=ADIOS2

export ADIOS2_DIR=/opt/adios/2.7.1
export openPMD_DIR=/opt/openPMD-api/0.14.3/
export PNGwriter_DIR=/opt/pngwriter/0.7.0/
export ISAAC_DIR=/opt/isaac/1.5.2/

# output data
export SCRATCH=/lustre/rz/dbertini/gpu/data
export WORKDIR=/lustre/rz/dbertini/gpu

## add hip+adios
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
export PATH=/opt/adios/2.7.1/bin:$PATH

## MPI
#export OMPI_MCA_io=^ompio
export OMPI_MCA_io=romio321

export OMPI_MCA_mpi_leave_pinned=0

export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000

/lustre/rz/dbertini/gpu/data/lwfa_002/input/bin/picongpu   -d 4 8 4                    -g 512 1536 512                      -s 1000                         -m --windowMovePoint 0.9                     --e_png.period 100 --e_png.axis yx --e_png.slicePoint 0.5 --e_png.folder pngElectronsYX                                 --e_png.period 100 --e_png.axis yz --e_png.slicePoint 0.5 --e_png.folder pngElectronsYZ                                 --e_phaseSpace.period 100                                      --e_phaseSpace.space y --e_phaseSpace.momentum py              --e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0                 --e_phaseSpace.filter all                               --e_energyHistogram.period 100                     --e_energyHistogram.binCount 1024                  --e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000                  --e_energyHistogram.filter all                           --openPMD.period 100                --openPMD.file simData              --openPMD.ext bp               --checkpoint.backend openPMD              --checkpoint.period 100              --checkpoint.restart.backend openPMD                               --e_macroParticlesCount.period 100                          --versionOnce

EOF
chmod +x /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh 

#if [ -d /lustre/rz/dbertini/gpu/data/lwfa_002/simOutput ]; then
#    echo " SimOutput does not exist! ... exiting"
#    exit 1 
#fi

if [ $? -eq 0 ] ; then
  # Run PIConGPU from within the singularity container ?
  # srun -K1 -vvvvsingularity exec --bind /cvmfs --rocm $WORKDIR/sifs/picongpu.sif  /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh
  srun -K1  singularity exec --bind /cvmfs --rocm $WORKDIR/sifs/picongpu.sif  /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh 
fi   

#this script was created with call cd /lustre/rz/dbertini/gpu/picInputs/myLWFA; /lustre/rz/dbertini/gpu/picongpu/bin/tbg -s sbatch -c etc/picongpu/16.cfg -t etc/picongpu/virgo-gsi/virgo.tpl /lustre/rz/dbertini/gpu/data/lwfa_002
steindev commented 2 years ago

The problem regarding crashes with openPMD output enabled is probably still related to memory, as writing data requires quiet some extra host-memory. In general, you need at least twice the amount of memory on the host that your simulation requires on the gpu. That is, your simulation setup should consume no more than 512GiB/2/8 = 32GiB per GPU. To be on the save side, when setting up simulations, make sure you do not require more than 28GiB per GPU.

Furthermore, you need to configure the ADIOS2 lib used during output in order to stay close to that 'twice the GPU memory' number and not require significantly more. In order to do so, use the following in your simulation's *.cfg (https://gist.github.com/steindev/0ea04341c96ef068a1e78a353763c521) and set \"InitialBufferSize\": \"28GB\" in line 20. In this snippet, infix is not relevant and an be changed, see docs. Adjust period to your liking, of course.

There is one more point. If you look closely into the TBG_ADIOS2_CONFIGURATION variable, you see that an operator of type blosc is applied to the dataset. That defines the compressor used during output. Do you have c-blosc (https://github.com/Blosc/c-blosc/tree/v1.21.1) in version 1.21.1 installed? If not, do so, as I believe the standard compressors in ADIOS2 do not like to compress datasets larger than 4GiB, which we certainly have in PIConGPU. So c-blosc is required and not using it may be the source of the error you experience.

Apart from this, I recommend to set in memory.param

constexpr size_t reservedGpuMemorySize = uint64_t(2147483648); // 2 GiB

as I still have experienced numerous errors on AMD MI100 with a smaller value. (We do know in more detail what the source of this error is and a bug report has been filed to AMD at least 3/4 of a year ago but they don't investigate it. We don't know why...:unamused: )

denisbertini commented 2 years ago

Thanks a lot for all the detailed informations ! I will process step by step all the improvements you proposed. Is there a reference, link, to the AMD MI 100 bug report you are quoting? No i do not have c-blosc installed.

denisbertini commented 2 years ago

can you explain the factor 2 in your memory calculation: 512 GiB/ (2) / 8 = 32 GiB ?

denisbertini commented 2 years ago

Another question : how to control the memory used by a simulation setup ? Using the picongpu memory calculator?

sbastrakov commented 2 years ago

The factor 2 is due to openPMD output. I forgot to take it into account in my earlier messages in this issue. The factor may be less than 2 actually, but 2 should definitely be safe.

sbastrakov commented 2 years ago

Controlling memory usage - yes. You need to know, or have an upper estimate of number of macroparticles per cell. Knowing this number and grid size allows estimating memory usage. Could be done on paper, could be with our memory calculator.

steindev commented 2 years ago

Is there a reference, link, to the AMD MI 100 bug report you are quoting? No i do not have c-blosc installed.

No, nothing public. It is within a closed workspace that we (AMD/HPE/HZDR/OLCF) share during the CAAR project.

Another question : how to control the memory used by a simulation setup ? Using the picongpu memory calculator?

Adding to @sbastrakov's answer: Keep in mind, that your initial particle distribution should not already fill the 28GiB other wise there is no space left for clusters of particles, such as the bunch that forms in LWFA.

denisbertini commented 2 years ago

For the mem calculation i will need to know how many macro-particle will be used by the simulation. Where to get this info? In the .param files?