Closed denisbertini closed 3 years ago
Hello @denisbertini . Yes, the "GPU" in the name is historical and currently PIConGPU runs on a variety of modern architectures, and you can totally use multicore CPUs on a cluster or workstation.
The basics of installing PIConGPU dependencies are described in our documentation here. The building and launching the code itself are described here. Please note that for CPUs, depending on your environment, you might need to explicitly use -b omp2b
option with pic-build
(also described in the Basics section). This will compile the code for CPUs and use OpenMP 2 for parallelization on shared memory (and MPI is always used for distributed memory).
In case you have further questions or encounter any issues related to PIConGPU, you are welcome to post it here or create other Github issues.
Here is also a related PIConGPU publication that does some benchmarks on CPU: https://arxiv.org/abs/1606.02862
We run regularly on CPU as well as GPU clusters with PIConGPU. We just keep the name because it's already well-known and GPUs will likely win the Exascale race, so they will be around as a fast target platform for a while ;-)
Hi, dear developers, I want to use PIConGPU because it has the module which can simulate the Thomson scattering while other PIC code cannot. I have already installed PIConGPU on my workstation, and I want to run it on CPUs. But I notice that no instructions about how to run PIConGPU on CPUs are included in the manual, it seems the tbg tool only works for GPU. Could you please provide some information about how to run it on CPUs? Thanks in advance.
Dear @prlWanted ,
The target hardware (and parallel programming model) are controlled by how you compile a simulation with PIConGPU. We call this hardware + software combination "backend". So when you are compiling with pic-build
there is an option -b
to specify the backend. This is (admittedly, very briefly) documented here. Your active software environment at the time of building should of course match your target backend, e.g. OpenMP / CUDA / etc. should be available. To enable that, we normally create .profile files to prepare the envieronment. We have examples here, I think one can rather easily derive their own based on those. Normally a .profile already defines a target backend and that serves as a default value to -b
. Please note that e.g. for machines with both CPUs and GPUs we normally create two sets of .profile + .tpl so that both configurations can be used, but in each terminal session only source
and use one.
To add up a little bit, tbg
and basically all the tools that come with PIConGPU are backend-agnostic. So launching PIConGPU does not depend on the backend at all (it is not even a command-line option of PIConGPU itself, since the backend is compiled into it). In case we refer to tbg
as GPU-only somehow in the documentation, this is probably our mistake, please report!
@prlWanted Great you tried PIConGPU! I hape the references @sbastrakov provided will help you setup PIConGPU for your CPU system. In case you have any questions, feel free to ask any further questions. Just a warning: computing the Thomson scattering via the radiation plugin is a computationally extremely expensive task. Thus, please do not be scared by the much longer compute times. We can help to optimize your setup if needed and possible.
Dear @sbastrakov,
Thanks for your comments. So now I compile with pic-build -b "omp2b"
and I find an executable picongpu
in the hidden folder .build
. But how do I run the simulation? By ./picongpu
? But the terminal tells me that I need to specify some parameters:
Usage picongpu [-d dx=1 dy=1 dz=1] -g width height depth [options]:
-h [ --help ] print help message and exit
--validate validate command line parameters and
exit
-v [ --version ] print version information and exit
-c [ --config ] arg Config file(s)
PIConGPU:
-s [ --steps ] arg Simulation steps
--checkpoint.restart.loop arg (=0) Number of times to restart the
simulation after simulation has
finished (for presentations). Note:
does not yet work with all plugins, see
issue #1305
and if I do ./picongpu -g 1000 1000 100
, I get:
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3AlpakaRand seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
Estimates are based on DensityRatio to BASE_DENSITY of each species
(see: density.param, speciesDefinition.param).
It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 200000000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time: 15sec 811msec = 15 sec
0 % = 0 | time elapsed: 0msec | avg time per step: 0msec
calculation simulation time: 16sec 642msec = 16 sec
full simulation time: 33sec 5msec = 33 sec
But I find no output files.
Do I need to use tbg? But it is for supercomputers, right? I am using a workstation.
@prlWanted Your setup is nearly correct, please add
- ./picongpu -g 1000 1000 100
+ ./picongpu -g 1000 1000 100 -s 1000
to run 1000 pic cycles.
[Update] Regarding output
To activate output, you need to add further command line arguments for each data analysis plugin. For that, please see our documentation
E.g. to run a energy histogram every 10th iteration and a maximum energy of 100000keV, please execute:
./picongpu -g 1000 1000 100 -s 1000 --e_energyHistogram.period 100 --e_energyHistogram.filter all --e_energyHistogram.maxEnergy 100000
Dear @PrometheusPi , Thanks a lot for your kindness. PIConGPU is appealing to me because of its new features including being able to run on GPU, simulate Thomson scattering of laser on electron bunch and input a Gaussian beam with tilted pulse front.
@prlWanted Sorry, I was still editing my last comment to add the output, please see the updated last comment.
@PrometheusPi But does that mean I am not using the parameters of the LWFA examples I cloned?
@prlWanted Are you part of the DESY team to study Thomson sources together with @MaxThevenet and @TheresaBruemmer?
@prlWanted Sorry, I missed the fact that you wanted to run an LWFA example.
The default LWFA example contains in etc/picongpu/
various *.cfg
files. These configuration files define the output that PIConGPU should create on run time.
In your CPU-only case you could for example execute from within your LWFA example (where you build with pic-build
):
$PICSRC/bin/tbg -t etc/picongpu/bash/mpiexec.tpl -c etc/picongpu/8.cfg ...some_path_you_want_to_output_to.../run001
and $PICSRC
being the path to your PIConGPU source code, while 8.cfg
being one of the configuration files.
This creates a simulation directory in ..some_path_you_want_to_output_to.../run001
. Go to this directory and run
bash ./tbg/submit.start
The last command runs the simulation from within your terminal, with all parallelization via MPI -(on 8 ranks) with all output defined in the 8.cfg
(png output (requires libpng), phase space output (requires openPMD or libSplash - depending on the version), hdf5 etc. ).
If you could provide the output of .picongpu --help
I could further tell you what output option where compiled and can thus be used.
edit: this somewhat repeats and extends on the explanation of @PrometheusPi above.
To clarify a little bit about tbg
.
PIConGPU is of course eventually just a normal terminal application and one can run it by providing the command-line parameters. Like you were trying and then parameters for output and other stuff can be added, e.g. what @PrometheusPi suggested. You can use .build/picongpu -h
to get all command-line options active for that particular build of PIConGPU (so that already takes into account your physical setup and software dependencies found). There is nothing inherently wrong with it, and for a first run on a new machine this may indeed be the best, as it's quick to try and you have a direct control of what goes into PIConGPU and can check it against the help output.
We normally do not do that (other than for testing) for two separate reasons.
First, as you rightfully pointed out, on clusters one normally can't directly launch applications, but instead needs to first put the PIConGPU binary call to mpiexec / mpirun, and then wrap it all up into a job for SLURM or other job system, while a typical workstation would not have any job management system. tbg
provides a unified interface to abstract from these details, so that for a user launching PIConGPU is done uniformly on any machine. Of course, in order to enable it one needs to first "describe" the machine to tbg
. This is done, again, by a combination of (a little bit of) .profile
file and (mostly) .tpl
file. These files are prepared once per system, and then can be reused. As I already mentioned, we have a list of those for some systems already. For a single workstation, perhaps you can start from this one and adapt it for your system.
Second, the number of command-line parameters grows very quickly for any realistic setup, especially regarding the output. And then it is much more convenient to group command-line options logically (e.g. grid resolution, boundary conditions, output, etc.) so that it's more human-readable, and then generate the command line based on that. In addition to physical and output options, command-line can also include run parameters such as MPI processes configuration, max wall time, etc. We put all of these parameters to .cfg
files. For the laser wakefield example, the available configurations are here. Or you would get it in your directory after using pic-create
for that example, as the documentation suggests. They are merely a glorified shell scripts, except for technical reasons there we write !TBG_variable
instead of $TBG_variable
to dereference a variable and then tbg replaces it.
@PrometheusPi I am not part of the DESY team, I am from ShenZhen Technology University, China. We are now developing Thomson source.
@PrometheusPi @sbastrakov ,
Thanks a lot. Then I know how to run the simulation. But I need to learn how to adapt .tpl
, .cfg
and .profile
.
@prlWanted If you need help with that, feel free to ask.
Would you be so kind and add you (your institute) to our community map: https://github.com/ComputationalRadiationPhysics/picongpu-communitymap ?
@PrometheusPi Thanks. Yes, of course.
Dear Developers,
Hi. I read through the user guide, and finally managed to run the example simulation of Thomson scattering on CPUs of the workstation . But I still have some problems. Could you please help me with them?
The first one is about the particle number used in simulations. For example, I noticed that in the .param
files, the author provides an option by:
`#ifdef PARAM_SINGLE_PARTICLE
CreateDensity<
densityProfiles::FreeFormula,
startPosition::OnePosition,
PIC_Electrons
,
else
CreateDensity< densityProfiles::GaussianCloud, startPosition::Random, PIC_Electrons ,
So according to the
FreeFormuladefined in
density.param, in the case I
#define PARAM_SINGLE_PARTICLE 1in
density.param, I am supposed to be able to simulate Thomson scattering of a single electron with the laser. But when I run the simulation, I got output saying
PIConGPUVerbose PHYSICS(1) | macro particles per device: 1572864. Considering I used 32 CPUs, the grid size is 128*3072*128, so I have particles in every cell! And the simulation is quite slow, while the Thomson scattering of one single electron is supposed to be very quick. I also notice that for density profile as
GaussianCloud, in most of the cells where the density is 0, there are still 6 (defined by
TYPICAL_PARTICLES_PER_CELL`) particles. So my first question is, shouldn't the particle number of the cells where density is 0 , is also 0?
The second one is about the GPU. We installed a Nvidia 2080ti GPU on the workstation and install PIConGPU with CUDA by spack. I changed npernode
to 1, but when I run the simulation, I received an error:
Invalid MIT-MAGIC-COOKIE-1 key--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:
App: /home/sztu/runs/cuda_thomson/input/bin/cuda_memtest.sh
Number of procs: 32
PPR: 1:node
Please revise the conflict and try again.
Could you help with it? Thanks a lot!
Hello @prlWanted ,
Regarding the single particle definition. From the logical point of view, that setting is supposed to mean one physical and one macroparticle per cell, not per simulation. FYI the number in output PIConGPUVerbose PHYSICS(1) | macro particles per device: 1572864
in only an estimate assuming homogeneous density (that we mostly use to see it anything is clearly wrong or not), so generally one cannot trust it to be the exact number. However in this case it should be the exact number and indeed means one macroparticle per cell.
Also, from a technical side you seem to have chosen an unsafe way of defining PARAM_SINGLE_PARTICLE
, as it is not guaranteed that this define is seen by all the files that use it. So you may have ended up with an inconsistent configuration where some files "saw" it set to 1, and others set to 0. A safe way to set such definitions is externally relative to the source code itself, by modifying file cmakeFlags. By default PIConGPU uses settings from flags[0], you could put the necessary defines there following the style of other options. I guess you could copy contents of flags[3]
to flags[0]
for this kind of a test.
Regarding the GPU question, how many GPUs do you have on the system? It seems you requested to use 1 GPU per node so that requires 32 nodes then, and normally workstations are viewed as a single node with multiple GPUs.
@sbastrakov Thanks for you quick reply. I changed the flags as you commented, but the simulation is still quite slow. I have 1 GPUs installed on the workstation, but we are going to have a GPU supercomputer as they said. I hope I can use PIConGPU there.
So with one GPU you could still run the example, but have to use a smaller grid so that it fits GPU memory. To do so, you can just modify your .cfg file, no recompilation of PIConGPU needed. Namely, these lines: Variables TBG_devices_x, y, z
mean the number of MPI processes (and each MPI process uses 1 GPU) along x, y, z in a Cartesian geometry, for 1 GPU the only valid option is to set all these values to 1. Then TBG_gridSize
is number of cells along x, y, z, I guess you should decrease somewhat proportionally, so that the product is about 32 times smaller than the current value.
@sbastrakov Hi. It still doesn't work. I did what you said. But I received an error:
Invalid MIT-MAGIC-COOKIE-1 key[12/07/2020 19:23:29][sztu-HP-Z8-G4-Workstation][0]:ERROR: CUDA error: CUDA driver version is insufficient for CUDA runtime version, line 279, file /home/sztu/src/spack/opt/spack/linux-ubuntu20.04-broadwell/gcc-5.5.0/picongpu-0.5.0-l75qt2jgv3nxufzxt4gteekhgaccdk5b/thirdParty/cuda_memtest/cuda_memtest.cu
cuda_memtest crash: see file /home/sztu/runs/cuda_thomson_001/simOutput/cuda_memtest_sztu-HP-Z8-G4-Workstation_0.err
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[33631,1],0]
Exit code: 1
It seems a problem of the memory, my CUDA version is cuda@10.2.89
.
So now the first issue is resolved and the cuda_memtest
program that we run just before PIConGPU is starting. And it fails because of the drivers, not memory (that program is supposed to check memory, but it can't start). Very likely so would PIConGPU for the same reason. Now you need to check what driver version is needed for the CUDA version you use, and update your system to that version. Or if you cannot update driver, can also downgrade CUDA version used. So you need to make CUDA version compatible to driver version somehow.
Thanks, I will try.
@prlWanted regarding your question about #define PARAM_SINGLE_PARTICLE 1
:
If set correctly, you will branch into a very specific density definition that, as @sbastrakov already mentioned, means one macro-particle per cell which also only represent one real electron. However, since the density definition is only non-zero in a single cell, the entire simulation should only contain one macro particle ( = one real electron). Thus in this case, single particle means really one electron per simulation. The number you have seen is just a estimate by PIConGPU assuming one particle-per-cell. Since in your 32.cfg
you use 128 x 3072 x 128 cells, this means more than 50,331,648 cells and thus for each of your 32 devices: 1,572,864 particles. The output is just an upper estimate and not aware of zero density in the density definition. You can check the real number of macro particles on run time via the output of the macro particle counter that should be activated in the 32.cfg
. For that look in simOutput
for a file named e_macroParticlesCount.dat
. It should only count a single macro-particle.
Ah, then I misunderstood the settings. Thanks for correcting @PrometheusPi .
@sbastrakov @PrometheusPi Thanks a lot. I finally got it run on GPU, but the particle number can not be too large or I get an error: "mallocMC", out of memory. I need to carefully adjust the particle number. Or can I first run the 2D simulation or Thomson scattering?
Nice to hear @prlWanted .
So adjusting the grid size and particle number to fit device memory is a constant need when using PIConGPU. Especially when running on GPUs one really wants to have device memory to be nearly fully used. This comes partially from experience, partially from trial and error, and we also have a helper tool lib/python/picongpu/utils/memory_calculator.py
to estimate memory requirements for a given grid size and number of macroparticles per cell, assuming that is constant.
For 2D, I am not sure if it makes sense physically for this setup, I think @PrometheusPi could comment on that.
Technically (for any setup) dimensionality is set in file dimension.param
. In case your setup does not have this file, you can pic-edit to create it.
I agree with @sbastrakov, fitting your simulation volume to the given resources is a complex task. The tool @sbastrakov recomanded is definitly idealy suited for this task. For details please see the documentation of the memory calulator. If you are running a single particle setup, you can also neglect the memory required by that single particle. (Be aware that however, that a single particle allocates already an entire particle frame, equivalent to 256 particles - however , this is still negligible for the single particle setup.)
For 2D radiation calculations: As long as your observation angles stay within the plain you simulate, a 2D setup should be fine. If your observation directions however points outside the plain of simulation, the scalar product between position and observation direction will result in numerical artifacts when evaluation the spectrally resolved Liénard Wiechert potentials. Furthermore, please be aware that the Liénard Wiechert potentials used assume a 3D space, even in a 2D simulation setup, thus electromagnetic fields will vanish as in 3D.
@sbastrakov @PrometheusPi Thanks, I will try.
Dear Developers,
Hi. We are trying to use PIConGPU on the CPU-based supercomputer. But we have some problems with the parallel running. We notice that there are several optional backends when doing pic-build -b
, what are the differences between omp2b
, serial
and 'threads', please? When mpirun
of OPENMPI
is used, how do we specify the arguments of mca
, like ^tcp
, openib
or self
? Could you provide a template script of job submitting on a CPU-based supercomputer? Thanks a lot.
So as @PrometheusPi answered above, -b
is used to specify alpaka backend to be used internally, and therefore what kind of hardware it can use. For PIConGPU on CPUs please use omp2b
or a more clarified omp2b:[architecture]
.
Regarding the options specific to OpenMPI. We also faced the need to pass those on some systems. And normally do so via the .tpl
file, like here. Since we (or our users normally) make a .tpl file for each system, or a partition of a cluster, this way it only needs to be figured out and specified once.
@prlWanted Did you succeed in runnning PIConGPU? Can this issue be closed?
@prlWanted Did you succeed in runnning PIConGPU? Can this issue be closed?
Yes, thanks to the developers' help, it has been successfully run. Please close it.
Hi I would like to know if PicOnGPU can performance as well on standard CPU-based linux cluster. Thanks in advance Denis