Closed denisbertini closed 1 year ago
Hello @denisbertini .
So PIConGPU seems to be starting and crashing soon after initialization starts. However particles are generated and Debye length involves a kernel, so it's not literally the first memory allocation or kernel launch that fails. Could you try increasing reserve memory size here.
This error at this point sounds strangely familiar, but I couldn't find it right away.
I should change the line
constexpr size_t reservedGpuMemorySize = 350 * 1024 * 1024;
but what value?
This is size that picongpu leaves free on each GPU. For sake of testing, please try a very small grid size in your .cfg file and leave e.g. 1 GB per GPU, so 1024 1024 1024
Note that you need to rebuild after changing that file, same as any other .param
.
Also your #SBATCH --mem=64G
appears way too low. As far as I can see, it is for the whole node with 8 GPUs. Normally with PIConGPU allocated host memory size should be at least same size as memory of all used GPUs of a node combined. In case there is relatively little host memory on a system, could you also try using fewer GPUs to check if this is problematic.
Now thinking of it, 64 GB host memory requested may be causing this issue.
We always allocate all GPU memory except that reservedGpuMemorySize
regardless of actual simulation size. And host generally needs at least same amount of memory per each GPU, so that host-device buffers can exist. So with 8 MI 100 GPUs one needs 8 x 32 GB host memory I guess? Or use fewer GPUs to match the host memory size
I may remember an early issue on AMD which looked like only 50% of GPU memory could be allocated. Was it this one you remembered @sbastrakov ?
Ah yes, that one! So now two independent things to investigate: host memory allocated size and device one
So
reservedGpuMemorySize
to 1G
and rebuild.
still the picongpu
process run out of memory.128G
instead of 64G
and now it seems to work for 2 MPI processes with a relatively small grid size 48 96 48
But i am still confused by the GPU Mapping for PiconGPU. I adapted the template i found inpicongpu/etc/spock-ornl
since the hardware setup is more or less the same and the script use sbatch as a scheduler.
I modified the tpl
file though in order to launch picongpu within a singularity container.
And it works fine for me now.
(BTW we could add this the your list of setup example in picongpu/etc
. It could help users which want to
use docker
or singularity
to run picongpu
.)
So now i can run with different configuration 1.cfg
, 2.cfg
and i tried also 4.cfg
all on one node for the moment.
My sbatch definition in my TBG template .tbl
is the following:
#SBATCH --partition=!TBG_queue
#SBATCH --time=!TBG_wallTime
# Sets batch job's name
#SBATCH --job-name=!TBG_jobName
#SBATCH --nodes=!TBG_nodes # Nb of nodes
#SBATCH --ntasks=!TBG_tasks # Nb of MPI tasks
#SBATCH --gres=gpu:!TBG_tasks
# SBATCH --ntasks-per-node=!TBG_devicesPerNode
# #SBATCH --mincpus=!TBG_mpiTasksPerNode
# #SBATCH --cpus-per-task=!TBG_coresPerGPU # CPU Cores per MPI process
#SBATCH --mem=128G # Requested Total Job Memory / Node
# #SBATCH --mem-per-gpu=!TBG_memPerDevice
# #SBATCH --gpu-bind=closest
#SBATCH --mail-type=!TBG_mailSettings
#SBATCH --mail-user=!TBG_mailAddress
#SBATCH --chdir=!TBG_dstPath
You see that i added there the allocation via `--gres=gpu:n_gpus` and commented some other options that i do not use for the GPU mapping defintiion.
Using such a defintion i run such job on our cluster ( `4.cfg` config is used)
obName : lwfa_015 singularity
Submit : 2022-01-11 14:29:39 2022-01-11 14:29:41
Start : 2022-01-11 14:29:40 2022-01-11 14:29:41
End : Unknown Unknown
UserCPU : 00:00:00 00:00:00
TotalCPU : 00:00:00 00:00:00
JobID : 36453737 36453737.0
JobIDRaw : 36453737 36453737.0
JobName : lwfa_015 singularity
Partition : gpu
NTasks : 4
AllocCPUS : 64 4
Elapsed : 00:16:37 00:16:36
State : RUNNING RUNNING
ExitCode : 0:0 0:0
AveCPUFreq : 0
ReqCPUFreqMin : Unknown Unknown
ReqCPUFreqMax : Unknown Unknown
ReqCPUFreqGov : Unknown Unknown
ReqMem : 128Gn 128Gn
ConsumedEnergy : 0
AllocGRES : gpu:4 gpu:4
ReqGRES : gpu:0 gpu:0
ReqTRES : billing=4+
AllocTRES : billing=1+ cpu=4,gre+
TotalReqMem : 512 GB 512 GB
Here you see that the `4 GPus` are allocated and that there is indeed 4 tasks and 64 allocated CPUS.
My questions:
- Does this output sounds correct to you?
- Should the options i commented be added ?
- Is my asumption 1 MPI tasks per GPU device correct?
- to run in multi nodes, should i just increase now the number of tasks? if yes i can not use `--gres` option anymore... is such a option relevant for `picongpu`?
Another question, how to define the macro_particle/cell
number in picongpu
and what is the default used?
Another problem, trying to run with the std 4.cfg
definition file i got a crash in openMPI
:
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:20216] *** An error occurred in MPI_Isend
[lxbk1122:20216] *** reported by process [4080535318,3]
[lxbk1122:20216] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
[lxbk1122:20216] *** MPI_ERR_OTHER: known error not in list
[lxbk1122:20216] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1122:20216] *** and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 36455560.0 ON lxbk1122 CANCELLED AT 2022-01-11T15:13:17 ***
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:20217] *** An error occurred in MPI_Isend
[lxbk1122:20217] *** reported by process [4080535318,1]
[lxbk1122:20217] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
any idea what is the problem with the transport?
For openMPI
i used the recommended options
# setup openMPI
export PMIX_MCA_gds=^ds21
export OMPI_MCA_io=^ompio
export OMPI_MCA_mpi_leave_pinned=0
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000
forget about my last MPI
noise, the proper openib
settings needed to be added !
well when i increase the grid siz, i still get the openMPI
crash:
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:24195] *** An error occurred in MPI_Isend
[lxbk1122:24195] *** reported by process [3365471779,2]
[lxbk1122:24195] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
[lxbk1122:24195] *** MPI_ERR_OTHER: known error not in list
[lxbk1122:24195] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1122:24195] *** and potentially your MPI job)
Is there a way to overcome this MPI limit ?
Thanks for a detailed description. Let me reply to your points separately
I changed reservedGpuMemorySize to 1G and rebuild. still the picongpu process run out of memory.
So i think reservedGpuMemorySize
is now cleared of suspicion and can be reverted to our default value.
host memory to 128G instead of 64G and now it seems to work for 2 MPI processes with a relatively small grid size
Okay, so the host memory size may be the issue. As i mentioned above, for our memory allocation size grid size should not matter as we try to take all but reservedGpuMemorySize
for any simulation. I wanted to try small size first just to not worry about this.
picongpu/etc/spock-ornl since the hardware setup is more or less the same and the script use sbatch as a scheduler. I modified the tpl file though in order to launch picongpu within a singularity container. And it works fine for me now.
Makes sense.
(BTW we could add this the your list of setup example in picongpu/etc. It could help users which want to use docker or singularity to run picongpu.)
We have some docs about docker here. But sure singularity use case should be documented as well, and perhaps also some general info about using PIConGPU with containers can be added.
You see that i added there the allocation via --gres=gpu:n_gpus and commented some other options that i do not use for > the GPU mapping defintiion.
In my experience, SLURM can be configured differently on different systems, e.g. which subset of its redundant set of variables is used. This is one of reasons we try to isolate it in .tpl files. Of course, it comes at a price that the first user on a system has to figure it our and set it up.
Is my asumption 1 MPI tasks per GPU device correct
Correct, this is only mode we support (or to be more precise, we run 1 MPI process on what is exposed as a GPU for the job). Your general output seems okay to me. I am no expert and generally, again, the right subset of SLURM veriables needs to be figured out for each system.
to run in multi nodes, should i just increase now the number of tasks? if yes i can not use --gres option anymore... is such a option relevant for picongpu?
Yes, increase the number of tasks, the number of nodes will be calculated from it by contents of the .tpl file. I am not sure what is a problem with --gres
, it is requirements per node. Again, what you have in the .tpl should already manage it properly, and if there is a problem we can adjust for it.
Could you also attach your current .tpl version so that we are on the same page?
Another question, how to define the macro_particle/cell number in picongpu and what is the default used?
It is set by a user. The naming and location is admittedly not obvious and can be improved. To use our LWFA example, when initializing a species, usually constructs like this are used. The second template parameter, in that case startPosition::Random2ppc
defines both number of macroparticles per cell and how they are initially distributes inside a cell. We try to name this type accordingly, but it is merely an alias. It is defined in particle.param
, for the LWFA example here. By changing numParticlesPerCell
there you can control ppc. There is also a doc page on macroparticle sampling here.
I just modified again the famous "system dependent" SLURM variables and now i am able to run picongpu
with the full 8 GPUs
on one node.
It seems to run well if the grid size is adjusted not to trigger the openMPI
crash i mentioned above.
I attach also my current template .tpl
virgo.tpl.txt
BTW feel free to correct/change things in the template
Could you also attach your .cfg file that triggers that openMPI error? So that we can figure our if PIConGPU should be sending this amount of data at all
Just as a stupid, but quick, things to try. We normally only use export OMPI_MCA_io=^ompio
and no other settings for openMPI in profiles. Does the issue persist for this case?
the cfg
that trigger the crash in openMPI
is the following:
# Copyright 2013-2021 Axel Huebl, Rene Widera, Felix Schmitt, Franz Poeschel
#
# This file is part of PIConGPU.
#
# PIConGPU is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PIConGPU is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PIConGPU.
# If not, see <http://www.gnu.org/licenses/>.
#
##
## This configuration file is used by PIConGPU's TBG tool to create a
## batch script for PIConGPU runs. For a detailed description of PIConGPU
## configuration files including all available variables, see
##
## docs/TBG_macros.cfg
##
#################################
## Section: Required Variables ##
#################################
TBG_wallTime="2:00:00"
TBG_devices_x=2
TBG_devices_y=4
TBG_devices_z=1
#TBG_gridSize="192 2048 160"
TBG_gridSize="192 1024 160"
TBG_steps="4000"
# leave TBG_movingWindow empty to disable moving window
TBG_movingWindow="-m --windowMovePoint 0.9"
#################################
## Section: Optional Variables ##
#################################
# png image output (rough electron density and laser preview)
TBG_pngYX="--e_png.period 100 \
--e_png.axis yx --e_png.slicePoint 0.5 \
--e_png.folder pngElectronsYX"
# energy histogram (electrons, [keV])
TBG_e_histogram="--e_energyHistogram.period 100 \
--e_energyHistogram.binCount 1024 \
--e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000 \
--e_energyHistogram.filter all"
# longitudinal phase space (electrons, [m_e c])
TBG_e_PSypy="--e_phaseSpace.period 100 \
--e_phaseSpace.space y --e_phaseSpace.momentum py \
--e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0 \
--e_phaseSpace.filter all"
TBG_openPMD="--openPMD.period 100 \
--openPMD.file simData \
--openPMD.ext bp \
--checkpoint.backend openPMD \
--checkpoint.period 100
--checkpoint.restart.backend openPMD"
# macro particle counter (electrons, debug information for memory)
TBG_e_macroCount="--e_macroParticlesCount.period 100"
TBG_plugins="!TBG_pngYX \
!TBG_e_histogram \
!TBG_e_PSypy \
!TBG_e_macroCount \
!TBG_openPMD"
#################################
## Section: Program Parameters ##
#################################
TBG_deviceDist="!TBG_devices_x !TBG_devices_y !TBG_devices_z"
TBG_programParams="-d !TBG_deviceDist \
-g !TBG_gridSize \
-s !TBG_steps \
!TBG_movingWindow \
!TBG_plugins \
--versionOnce"
# TOTAL number of devices
TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))"
"$TBG_cfgPath"/submitAction.sh
i just commented the grid size generating the crash and reduce the Y_dim1
by factor 2 to oversome the openMPI
limitation
The output of the simulation using the modified 8.cfg
seems to be ok:
Running program...
PIConGPU: 0.7.0-dev
Build-Type: Release
Third party:
OS: Linux-3.10.0-1160.31.1.el7.x86_64
arch: x86_64
CXX: Clang (13.0.0)
CMake: 3.20.5
Boost: 1.75.0
MPI:
standard: 3.1
flavor: OpenMPI (4.0.3)
PNGwriter: 0.7.0
openPMD: 0.14.3
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00229 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
Estimates are based on DensityRatio to BASE_DENSITY of each species
(see: density.param, speciesDefinition.param).
It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.0247974)
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 7864320
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 117120 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 117120 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0 keV and corresponding Debye length 0 m.
The grid has 0 cells per average Debye length
initialization time: 4sec 332msec = 4.332 sec
0 % = 0 | time elapsed: 17sec 449msec | avg time per step: 0msec
5 % = 200 | time elapsed: 31sec 758msec | avg time per step: 14msec
10 % = 400 | time elapsed: 58sec 329msec | avg time per step: 13msec
15 % = 600 | time elapsed: 1min 26sec 395msec | avg time per step: 16msec
20 % = 800 | time elapsed: 1min 56sec 448msec | avg time per step: 19msec
25 % = 1000 | time elapsed: 2min 26sec 663msec | avg time per step: 28msec
30 % = 1200 | time elapsed: 2min 59sec 763msec | avg time per step: 23msec
35 % = 1400 | time elapsed: 3min 29sec 455msec | avg time per step: 25msec
40 % = 1600 | time elapsed: 4min 1sec 669msec | avg time per step: 25msec
45 % = 1800 | time elapsed: 4min 33sec 885msec | avg time per step: 27msec
50 % = 2000 | time elapsed: 5min 6sec 221msec | avg time per step: 33msec
55 % = 2200 | time elapsed: 5min 38sec 681msec | avg time per step: 29msec
60 % = 2400 | time elapsed: 6min 12sec 680msec | avg time per step: 28msec
65 % = 2600 | time elapsed: 6min 44sec 568msec | avg time per step: 26msec
70 % = 2800 | time elapsed: 7min 14sec 657msec | avg time per step: 22msec
75 % = 3000 | time elapsed: 7min 44sec 405msec | avg time per step: 28msec
80 % = 3200 | time elapsed: 8min 12sec 640msec | avg time per step: 22msec
85 % = 3400 | time elapsed: 8min 41sec 251msec | avg time per step: 21msec
90 % = 3600 | time elapsed: 9min 17sec 513msec | avg time per step: 21msec
95 % = 3800 | time elapsed: 9min 46sec 544msec | avg time per step: 21msec
100 % = 4000 | time elapsed: 10min 14sec 462msec | avg time per step: 20msec
calculation simulation time: 10min 26sec 649msec = 626.649 sec
full simulation time: 10min 31sec 765msec = 631.765 sec
Since i do not have any performance reference, is the wall_time
of 10min 31 sec
ok for such a simulation ?
Thanks I will now do some back of the envelope estimates of how communication should happen then.
For performance, i do not have a reference in my head for LWFA. You can try running our benchmark setup for which we have an idea how it should perform
OK thanks !
Do you know also if there is some documentation related to this LWFA
example?
If you mean specifically for LWFA, there is only a small doc page here. In case there are some physics questions, my colleagues could help (i am a computer scientist).
OK thanks a lot !
BTW this is exactly the hardware setup we have, just double GPU/RAM/Cores. setup I was trying to use it to find out optimal options for SLURM, but if you can help, i would appreciate !
So for that .cfg
file there is no way PIConGPU should be attempting sending with message size 1223060560. It could either be a result of some error in PIConGPU or an issue of OpenMPI or a misreported issue. With all 3 it's weird how did we never see it before. To investigate further, you could rebuild PIConGPU in debug mode like described here with 127 for PIC_VERBOSE
and PMACC_VERBOSE
, run and attach stdout and stderr. Then we may be able to get the message sizes PIConGPU requested to send.
BTW this is exactly the hardware setup we have, just double GPU/RAM/Cores. setup I was trying to use it to find out optimal options for SLURM, but if you can help, i would appreciate !
Ideally, your system documentation or admin should have the recommended ways of submitting jobs. We normally start from there when setting PIConGPU on a new system, and then adjust / make support tickets when something does not work (of course, depending on IT infrastructure and workpower). In case there is none, that linked docs of a similar system is a good start. I think generally if one has some working configuration that allows running jobs, it's then most reasonable to make sure MPI and all needed dependencies (openPMD API etc.) work fine. The .tpl file can be refined later as well.
Sure, and i think i will discover more things along the way
BTW, what is the procedure to run the benchmarks tests, like any other examples i.e using pic-create
and pic-build
?
Yes, it is just another example, but made specifically for performance measurements
Ok thanks!
@sbastrakov running the benchmarks with config 1,cfg
i got the following results:
initialization time: 1sec 312msec = 1.312 sec
0 % = 0 | time elapsed: 74msec | avg time per step: 0msec
5 % = 50 | time elapsed: 1sec 778msec | avg time per step: 34msec
10 % = 100 | time elapsed: 3sec 543msec | avg time per step: 35msec
15 % = 150 | time elapsed: 5sec 275msec | avg time per step: 34msec
20 % = 200 | time elapsed: 6sec 984msec | avg time per step: 34msec
25 % = 250 | time elapsed: 8sec 666msec | avg time per step: 33msec
30 % = 300 | time elapsed: 10sec 363msec | avg time per step: 33msec
35 % = 350 | time elapsed: 12sec 9msec | avg time per step: 32msec
40 % = 400 | time elapsed: 13sec 615msec | avg time per step: 32msec
45 % = 450 | time elapsed: 15sec 260msec | avg time per step: 32msec
50 % = 500 | time elapsed: 17sec 109msec | avg time per step: 36msec
55 % = 550 | time elapsed: 18sec 870msec | avg time per step: 34msec
60 % = 600 | time elapsed: 20sec 604msec | avg time per step: 34msec
65 % = 650 | time elapsed: 22sec 333msec | avg time per step: 34msec
70 % = 700 | time elapsed: 24sec 45msec | avg time per step: 34msec
75 % = 750 | time elapsed: 25sec 784msec | avg time per step: 34msec
80 % = 800 | time elapsed: 27sec 458msec | avg time per step: 33msec
85 % = 850 | time elapsed: 29sec 152msec | avg time per step: 33msec
90 % = 900 | time elapsed: 30sec 829msec | avg time per step: 33msec
95 % = 950 | time elapsed: 32sec 523msec | avg time per step: 33msec
100 % = 1000 | time elapsed: 34sec 203msec | avg time per step: 33msec
calculation simulation time: 34sec 215msec = 34.215 sec
full simulation time: 35sec 615msec = 35.615 sec
the other config 1_radiation.cfg
used an unrecognized option
unrecognised option '--e_radiation.period'
34 seconds is reasonable, that should be about 1 ns per particle update. So performance-wise on 1 GPU all seems good on your side.
1_radiation.cfg
requires radiation plugin to run. The plugin is conditionally enabled if you have a supported openPMD API version with the HDF5 backend. Currently it is the only remaining plugin requiring a specific backend. So could it be that you have openPMD API (I assume, since you ran LWFA before edit: I now see you have it in previously attached logs) but with ADIOS backend? Merely for testing purposes i think the first run is sufficient however.
No i did not up to now used ADIOS plugin for the IO. If the first benchmarks is enough to fully test, then it is enough for me !
@denisbertini Sorry for the late reply. Do I understand correctly, that you can still not run on more than two GPUs?
Could you post the tbg/submit.start
from the simulation directory of one of your crashing simulations? Furthermore, where does your system differ from Spock (https://docs.olcf.ornl.gov/systems/spock_quick_start_guide.html#spock-compute-nodes)? (number of cores, number of GPUs, host memory?)
Also, set the reserved memory in memory.param
to 2GiB.
We found on Spock that simulations with lower values crash often while this value works most of the time.
Also, when you experience crashes, try to run without openPMD output, it requires lots of host memory. Yet I doubt that this is your problem with the LWFA example.
I can run now on more than 2 GPUs. My last try was actually running picongpu
on 16 nodes each having 8 GPUs and it works.
I already set reserved memory to 1G
in memory.param
.
Very interesting fact though:
Indeed i noticed that when i switched on the openPMD output, the program crashes or get even stuck
and i need to kill it manually .
I also have to say that i increased the grid size of the standard 16.cfg
. i have now the following setup:
TBG_wallTime="5:00:00"
TBG_devices_x=4
TBG_devices_y=8
TBG_devices_z=4
TBG_gridSize="512 1536 512"
Which of course will dump much more data to output then the sandard simulation.
I am not sure the problem is linked to memory. If memory usage would be too high the system would have killed the corresponding processes and SLURM would have returned OUT_OF_MEMORY
error status.
This is not the case here.
For example sometimes the program crashes with such errors comming from MPI
picongpu: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
pointing here to a possible race condition problem.
Reducing the number of openMP
threads reduces occurence of such an error.
When i run without openPMD output... everything works stable.
The question is that, how one can work without dumping the full field
and particle
data ?
Is there a way to avoid the full dumping in picongpu
, other plugins?
In order to assess the problem with openPMD, please don't forego answering the following questions: Could you post the tbg/submit.start from the simulation directory of one of your crashing simulations? Furthermore, where does your system differ from Spock (https://docs.olcf.ornl.gov/systems/spock_quick_start_guide.html#spock-compute-nodes)? (number of cores, number of GPUs, host memory?)
Simulations without full output are often possible. Using e.g. the phase space plugin or tracer and probe particles to study particle and field evolution. See manual (https://picongpu.readthedocs.io/en/0.6.0/usage/plugins/phaseSpace.html#usage-plugins-phasespace), (https://picongpu.readthedocs.io/en/0.6.0/usage/workflows/tracerParticles.html) and (https://picongpu.readthedocs.io/en/0.6.0/usage/workflows/probeParticles.html), respectively. However, if you want to create checkpoints as saves for long running simulations or to restart simulations from an intermediate step with possibly altered parameters from this step on, you will need the capability to write full simulation output.
OK sorry ! Llet me try to answer your questions. Differences from Spock are minor:
lustre
shared file system and we use for internode connection Infiniband. and the script submit.start
i am using is the following:
#!/bin/bash
# Copyright 2013-2021 Axel Huebl, Richard Pausch, Rene Widera, Sergei Bastrakov, Klaus Steinger
#
# This file is part of PIConGPU.
#
# PIConGPU is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PIConGPU is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PIConGPU.
# If not, see <http://www.gnu.org/licenses/>.
#
# PIConGPU batch script for spock's SLURM batch system
#SBATCH --partition=gpu
#SBATCH --time=5:00:00
#SBATCH --job-name=lwfa_002
#SBATCH --nodes=16 # Nb of nodes
#SBATCH --ntasks=128 # Nb of MPI tasks
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=12 # CPU Cores per MPI process
#SBATCH --mem=0 # Requested Total Job Memory / Node
#SBATCH --mem-per-gpu=64000000
#SBATCH --gpu-bind=closest
#SBATCH --mail-type=NONE
#SBATCH --mail-user=d.bertini@gsi.de
#SBATCH --chdir=/lustre/rz/dbertini/gpu/data/lwfa_002
#SBATCH -o pog_%j.out
#SBATCH -e pog_%j.err
## calculations will be performed by tbg ##
# settings that can be controlled by environment variables before submit
# number of available/hosted devices per node in the system
# host memory per device
# number of CPU cores to block per GPU
# we have 12 CPU cores per GPU (96cores/8gpus ~ 12cores)
#.TBG_coresPerGPU=16
# Assign one OpenMP thread per available core per GPU (=task)
#export OMP_NUM_THREADS=12
export OMP_NUM_THREADS=1
# required GPUs per node for the current job
# We only start 1 MPI task per device
# use ceil to caculate nodes
## end calculations ##
echo 'Running program...'
cd /lustre/rz/dbertini/gpu/data/lwfa_002
export MODULES_NO_OUTPUT=1
source /lustre/rz/dbertini/gpu/picongpu.profile
if [ $? -ne 0 ] ; then
echo "Error: PIConGPU environment profile under \"/lustre/rz/dbertini/gpu/picongpu.profile\" not found!"
exit 1
fi
unset MODULES_NO_OUTPUT
# set user rights to u=rwx;g=r-x;o=---
umask 0027
echo "creating simOutput directory ... with delay"
mkdir simOutput 2> /dev/null
sleep 2
cd simOutput
Compilers
export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/bin/hipcc
# Main environment variables
export PICHOME=/lustre/rz/dbertini/gpu/picongpu
export PICSRC=$PICHOME
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PATH=$PICSRC:$PATH
export PATH=$PICSRC/bin:$PATH
export PATH=$PICSRC/src/tools/bin:$PATH
export PYTHONPATH=$PICSRC/lib/python:$PYTHONPATH
export ADIOS2_DIR=/opt/adios/2.7.1
export openPMD_DIR=/opt/openPMD-api/0.14.3/
export PNGwriter_DIR=/opt/pngwriter/0.7.0/
export ISAAC_DIR=/opt/isaac/1.5.2/
# output data
export SCRATCH=/lustre/rz/dbertini/gpu/data
export WORKDIR=/lustre/rz/dbertini/gpu
### environment
export PATH=/usr/local/bin:$PATH
export PATH=/opt/rocm/bin:$PATH
# ## picongpu dependencies
export LD_LIBRARY_PATH=/opt/adios/1.13.1/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/boost/1.75.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/hdf5/1.10.7/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/icet/2.9.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/isaac/1.5.2/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/janson/2.9.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/libsplash/1.7.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/libpngwriter/0.7.0/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/openPMD-api/0.14.3/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
# add necessary rocm-libs part
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
# add adios 2
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/adios/2.7.1/bin:$PATH
# add necessary rocm-libs part
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
# add adios 2
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/adios/2.7.1/bin:$PATH
export CPLUS_INCLUDE_PATH=/opt/adios/1.13.1/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/adios/2.7.1/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/boost/1.75.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/hdf5/1.10.7/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/icet/2.9.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/isaac/1.5.2/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/janson/2.9.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/libsplash/1.7.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/pngwriter/0.7.0/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/openPMD-api/0.14.3/include:$CPLUS_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=/opt/rocm/include:$CPLUS_INCLUDE_PATH
export PATH=/opt/adios/1.13.1/bin:$PATH
export PATH=/opt/hdf5/1.10.7/bin:$PATH
export PATH=/opt/libsplash/1.7.0/bin:$PATH
export PATH=/opt/openPMD-api/0.14.3/bin:$PATH
## add hip+adios
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
export PATH=/opt/adios/2.7.1/bin:$PATH
# setup openMPI
export PMIX_MCA_gds=^ds21
#export OMPI_MCA_io=^ompio
export OMPI_MCA_io=romio321
export ROMIO_HINTS=./my_romio_hints
cat << EOF > ./my_romio_hints
romio_cb_write enable
romio_ds_write enable
cb_buffer_size 16777216
cb_nodes 16
EOF
export OMPI_MCA_mpi_leave_pinned=0
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000
export OPENPMD_BP_BACKEND=ADIOS2
cat > /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh <<EOF
#!/bin/bash
# Compilers
export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/bin/hipcc
# Main environment variables
export PICHOME=/lustre/rz/dbertini/gpu/picongpu
export PICSRC=$PICHOME
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PATH=$PICSRC:$PATH
export PATH=$PICSRC/bin:$PATH
export PATH=$PICSRC/src/tools/bin:$PATH
export PYTHONPATH=$PICSRC/lib/python:$PYTHONPATH
export OPENPMD_BP_BACKEND=ADIOS2
export ADIOS2_DIR=/opt/adios/2.7.1
export openPMD_DIR=/opt/openPMD-api/0.14.3/
export PNGwriter_DIR=/opt/pngwriter/0.7.0/
export ISAAC_DIR=/opt/isaac/1.5.2/
# output data
export SCRATCH=/lustre/rz/dbertini/gpu/data
export WORKDIR=/lustre/rz/dbertini/gpu
## add hip+adios
export LD_LIBRARY_PATH=/opt/rocm/hiprand/lib:/opt/rocm/rocrand/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/adios/2.7.1/lib:$LD_LIBRARY_PATH
export PATH=/opt/rocm/hiprand/bin:$PATH
export PATH=/opt/adios/2.7.1/bin:$PATH
## MPI
#export OMPI_MCA_io=^ompio
export OMPI_MCA_io=romio321
export OMPI_MCA_mpi_leave_pinned=0
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000
/lustre/rz/dbertini/gpu/data/lwfa_002/input/bin/picongpu -d 4 8 4 -g 512 1536 512 -s 1000 -m --windowMovePoint 0.9 --e_png.period 100 --e_png.axis yx --e_png.slicePoint 0.5 --e_png.folder pngElectronsYX --e_png.period 100 --e_png.axis yz --e_png.slicePoint 0.5 --e_png.folder pngElectronsYZ --e_phaseSpace.period 100 --e_phaseSpace.space y --e_phaseSpace.momentum py --e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0 --e_phaseSpace.filter all --e_energyHistogram.period 100 --e_energyHistogram.binCount 1024 --e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000 --e_energyHistogram.filter all --openPMD.period 100 --openPMD.file simData --openPMD.ext bp --checkpoint.backend openPMD --checkpoint.period 100 --checkpoint.restart.backend openPMD --e_macroParticlesCount.period 100 --versionOnce
EOF
chmod +x /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh
#if [ -d /lustre/rz/dbertini/gpu/data/lwfa_002/simOutput ]; then
# echo " SimOutput does not exist! ... exiting"
# exit 1
#fi
if [ $? -eq 0 ] ; then
# Run PIConGPU from within the singularity container ?
# srun -K1 -vvvvsingularity exec --bind /cvmfs --rocm $WORKDIR/sifs/picongpu.sif /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh
srun -K1 singularity exec --bind /cvmfs --rocm $WORKDIR/sifs/picongpu.sif /lustre/rz/dbertini/gpu/data/lwfa_002/tbg/pic_sub.sh
fi
#this script was created with call cd /lustre/rz/dbertini/gpu/picInputs/myLWFA; /lustre/rz/dbertini/gpu/picongpu/bin/tbg -s sbatch -c etc/picongpu/16.cfg -t etc/picongpu/virgo-gsi/virgo.tpl /lustre/rz/dbertini/gpu/data/lwfa_002
The problem regarding crashes with openPMD output enabled is probably still related to memory, as writing data requires quiet some extra host-memory. In general, you need at least twice the amount of memory on the host that your simulation requires on the gpu. That is, your simulation setup should consume no more than 512GiB/2/8 = 32GiB per GPU. To be on the save side, when setting up simulations, make sure you do not require more than 28GiB per GPU.
Furthermore, you need to configure the ADIOS2 lib used during output in order to stay close to that 'twice the GPU memory' number and not require significantly more.
In order to do so, use the following in your simulation's *.cfg
(https://gist.github.com/steindev/0ea04341c96ef068a1e78a353763c521) and set \"InitialBufferSize\": \"28GB\"
in line 20.
In this snippet, infix
is not relevant and an be changed, see docs.
Adjust period
to your liking, of course.
There is one more point. If you look closely into the TBG_ADIOS2_CONFIGURATION
variable, you see that an operator of type blosc
is applied to the dataset.
That defines the compressor used during output.
Do you have c-blosc (https://github.com/Blosc/c-blosc/tree/v1.21.1) in version 1.21.1 installed?
If not, do so, as I believe the standard compressors in ADIOS2 do not like to compress datasets larger than 4GiB, which we certainly have in PIConGPU.
So c-blosc is required and not using it may be the source of the error you experience.
Apart from this, I recommend to set in memory.param
constexpr size_t reservedGpuMemorySize = uint64_t(2147483648); // 2 GiB
as I still have experienced numerous errors on AMD MI100 with a smaller value. (We do know in more detail what the source of this error is and a bug report has been filed to AMD at least 3/4 of a year ago but they don't investigate it. We don't know why...:unamused: )
Thanks a lot for all the detailed informations ! I will process step by step all the improvements you proposed. Is there a reference, link, to the AMD MI 100 bug report you are quoting? No i do not have c-blosc installed.
can you explain the factor 2 in your memory calculation:
512 GiB/ (2) / 8
= 32 GiB ?
Another question : how to control the memory used by a simulation setup ? Using the picongpu
memory calculator?
The factor 2 is due to openPMD output. I forgot to take it into account in my earlier messages in this issue. The factor may be less than 2 actually, but 2 should definitely be safe.
Controlling memory usage - yes. You need to know, or have an upper estimate of number of macroparticles per cell. Knowing this number and grid size allows estimating memory usage. Could be done on paper, could be with our memory calculator.
Is there a reference, link, to the AMD MI 100 bug report you are quoting? No i do not have c-blosc installed.
No, nothing public. It is within a closed workspace that we (AMD/HPE/HZDR/OLCF) share during the CAAR project.
Another question : how to control the memory used by a simulation setup ? Using the picongpu memory calculator?
Adding to @sbastrakov's answer: Keep in mind, that your initial particle distribution should not already fill the 28GiB other wise there is no space left for clusters of particles, such as the bunch that forms in LWFA.
For the mem calculation i will need to know how many macro-particle
will be used by the simulation.
Where to get this info?
In the .param
files?
Hi I am able to run
PicOnGPU
(dev
branch) on ourAMD MI 100 GPUs
cluster but only on single GPU mode. As soon as i try to run the code in multiple GPU mode with more MPI tasks, thePicOnGPU
process is killed by the OS and the slurm scheduler report an Out Of Memory errors:And the out of memory failure always happened just after the program initialisation:
This is the GPU mapping i used to submit
picongpu
and the options used for
picongpu
corresponding to this mapping are:Something seems to be wrong in the definition of this mapping. Any ideas what could be wrong here ?