PIConGPU without GPU? - Githubissues

denisbertini commented 4 years ago

Hi I would like to know if PicOnGPU can performance as well on standard CPU-based linux cluster. Thanks in advance Denis

sbastrakov commented 4 years ago

Hello @denisbertini . Yes, the "GPU" in the name is historical and currently PIConGPU runs on a variety of modern architectures, and you can totally use multicore CPUs on a cluster or workstation.

The basics of installing PIConGPU dependencies are described in our documentation here. The building and launching the code itself are described here. Please note that for CPUs, depending on your environment, you might need to explicitly use -b omp2b option with pic-build (also described in the Basics section). This will compile the code for CPUs and use OpenMP 2 for parallelization on shared memory (and MPI is always used for distributed memory).

In case you have further questions or encounter any issues related to PIConGPU, you are welcome to post it here or create other Github issues.

ax3l commented 4 years ago

Here is also a related PIConGPU publication that does some benchmarks on CPU: https://arxiv.org/abs/1606.02862

We run regularly on CPU as well as GPU clusters with PIConGPU. We just keep the name because it's already well-known and GPUs will likely win the Exascale race, so they will be around as a fast target platform for a while ;-)

prlWanted commented 3 years ago

Hi, dear developers, I want to use PIConGPU because it has the module which can simulate the Thomson scattering while other PIC code cannot. I have already installed PIConGPU on my workstation, and I want to run it on CPUs. But I notice that no instructions about how to run PIConGPU on CPUs are included in the manual, it seems the tbg tool only works for GPU. Could you please provide some information about how to run it on CPUs? Thanks in advance.

sbastrakov commented 3 years ago

Dear @prlWanted ,

The target hardware (and parallel programming model) are controlled by how you compile a simulation with PIConGPU. We call this hardware + software combination "backend". So when you are compiling with pic-build there is an option -b to specify the backend. This is (admittedly, very briefly) documented here. Your active software environment at the time of building should of course match your target backend, e.g. OpenMP / CUDA / etc. should be available. To enable that, we normally create .profile files to prepare the envieronment. We have examples here, I think one can rather easily derive their own based on those. Normally a .profile already defines a target backend and that serves as a default value to -b. Please note that e.g. for machines with both CPUs and GPUs we normally create two sets of .profile + .tpl so that both configurations can be used, but in each terminal session only source and use one.

sbastrakov commented 3 years ago

To add up a little bit, tbg and basically all the tools that come with PIConGPU are backend-agnostic. So launching PIConGPU does not depend on the backend at all (it is not even a command-line option of PIConGPU itself, since the backend is compiled into it). In case we refer to tbg as GPU-only somehow in the documentation, this is probably our mistake, please report!

PrometheusPi commented 3 years ago

@prlWanted Great you tried PIConGPU! I hape the references @sbastrakov provided will help you setup PIConGPU for your CPU system. In case you have any questions, feel free to ask any further questions. Just a warning: computing the Thomson scattering via the radiation plugin is a computationally extremely expensive task. Thus, please do not be scared by the much longer compute times. We can help to optimize your setup if needed and possible.

prlWanted commented 3 years ago

Dear @sbastrakov, Thanks for your comments. So now I compile with pic-build -b "omp2b" and I find an executable picongpu in the hidden folder .build. But how do I run the simulation? By ./picongpu? But the terminal tells me that I need to specify some parameters:

Usage picongpu [-d dx=1 dy=1 dz=1] -g width height depth [options]:
  -h [ --help ]                         print help message and exit
  --validate                            validate command line parameters and 
                                        exit
  -v [ --version ]                      print version information and exit
  -c [ --config ] arg                   Config file(s)

PIConGPU:
  -s [ --steps ] arg                    Simulation steps
  --checkpoint.restart.loop arg (=0)    Number of times to restart the 
                                        simulation after simulation has 
                                        finished (for presentations). Note: 
                                        does not yet work with all plugins, see
                                        issue #1305

and if I do ./picongpu -g 1000 1000 100, I get:

PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3AlpakaRand seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 200000000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time: 15sec 811msec = 15 sec
  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
calculation  simulation time: 16sec 642msec = 16 sec
full simulation time: 33sec   5msec = 33 sec

But I find no output files.

Do I need to use tbg? But it is for supercomputers, right? I am using a workstation.

PrometheusPi commented 3 years ago

@prlWanted Your setup is nearly correct, please add

- ./picongpu -g 1000 1000 100
+ ./picongpu -g 1000 1000 100 -s 1000

to run 1000 pic cycles.

[Update] Regarding output

To activate output, you need to add further command line arguments for each data analysis plugin. For that, please see our documentation

E.g. to run a energy histogram every 10th iteration and a maximum energy of 100000keV, please execute:

./picongpu -g 1000 1000 100 -s 1000 --e_energyHistogram.period 100 --e_energyHistogram.filter all --e_energyHistogram.maxEnergy 100000

prlWanted commented 3 years ago

Dear @PrometheusPi , Thanks a lot for your kindness. PIConGPU is appealing to me because of its new features including being able to run on GPU, simulate Thomson scattering of laser on electron bunch and input a Gaussian beam with tilted pulse front.

PrometheusPi commented 3 years ago

@prlWanted Sorry, I was still editing my last comment to add the output, please see the updated last comment.

prlWanted commented 3 years ago

@PrometheusPi But does that mean I am not using the parameters of the LWFA examples I cloned?

PrometheusPi commented 3 years ago

@prlWanted Are you part of the DESY team to study Thomson sources together with @MaxThevenet and @TheresaBruemmer?

PrometheusPi commented 3 years ago

@prlWanted Sorry, I missed the fact that you wanted to run an LWFA example.

The default LWFA example contains in etc/picongpu/ various *.cfg files. These configuration files define the output that PIConGPU should create on run time.

In your CPU-only case you could for example execute from within your LWFA example (where you build with pic-build):

$PICSRC/bin/tbg -t etc/picongpu/bash/mpiexec.tpl -c etc/picongpu/8.cfg ...some_path_you_want_to_output_to.../run001

and $PICSRC being the path to your PIConGPU source code, while 8.cfg being one of the configuration files.

This creates a simulation directory in ..some_path_you_want_to_output_to.../run001. Go to this directory and run

bash ./tbg/submit.start

The last command runs the simulation from within your terminal, with all parallelization via MPI -(on 8 ranks) with all output defined in the 8.cfg (png output (requires libpng), phase space output (requires openPMD or libSplash - depending on the version), hdf5 etc. ).

If you could provide the output of .picongpu --help I could further tell you what output option where compiled and can thus be used.

sbastrakov commented 3 years ago

edit: this somewhat repeats and extends on the explanation of @PrometheusPi above.

To clarify a little bit about tbg.

PIConGPU is of course eventually just a normal terminal application and one can run it by providing the command-line parameters. Like you were trying and then parameters for output and other stuff can be added, e.g. what @PrometheusPi suggested. You can use .build/picongpu -h to get all command-line options active for that particular build of PIConGPU (so that already takes into account your physical setup and software dependencies found). There is nothing inherently wrong with it, and for a first run on a new machine this may indeed be the best, as it's quick to try and you have a direct control of what goes into PIConGPU and can check it against the help output.

We normally do not do that (other than for testing) for two separate reasons.

First, as you rightfully pointed out, on clusters one normally can't directly launch applications, but instead needs to first put the PIConGPU binary call to mpiexec / mpirun, and then wrap it all up into a job for SLURM or other job system, while a typical workstation would not have any job management system. tbg provides a unified interface to abstract from these details, so that for a user launching PIConGPU is done uniformly on any machine. Of course, in order to enable it one needs to first "describe" the machine to tbg. This is done, again, by a combination of (a little bit of) .profile file and (mostly) .tpl file. These files are prepared once per system, and then can be reused. As I already mentioned, we have a list of those for some systems already. For a single workstation, perhaps you can start from this one and adapt it for your system.

Second, the number of command-line parameters grows very quickly for any realistic setup, especially regarding the output. And then it is much more convenient to group command-line options logically (e.g. grid resolution, boundary conditions, output, etc.) so that it's more human-readable, and then generate the command line based on that. In addition to physical and output options, command-line can also include run parameters such as MPI processes configuration, max wall time, etc. We put all of these parameters to .cfg files. For the laser wakefield example, the available configurations are here. Or you would get it in your directory after using pic-create for that example, as the documentation suggests. They are merely a glorified shell scripts, except for technical reasons there we write !TBG_variable instead of $TBG_variable to dereference a variable and then tbg replaces it.

prlWanted commented 3 years ago

@PrometheusPi I am not part of the DESY team, I am from ShenZhen Technology University, China. We are now developing Thomson source.

prlWanted commented 3 years ago

@PrometheusPi @sbastrakov , Thanks a lot. Then I know how to run the simulation. But I need to learn how to adapt .tpl, .cfg and .profile.

PrometheusPi commented 3 years ago

@prlWanted If you need help with that, feel free to ask.

Would you be so kind and add you (your institute) to our community map: https://github.com/ComputationalRadiationPhysics/picongpu-communitymap ?

prlWanted commented 3 years ago

@PrometheusPi Thanks. Yes, of course.

prlWanted commented 3 years ago

Dear Developers,

Hi. I read through the user guide, and finally managed to run the example simulation of Thomson scattering on CPUs of the workstation . But I still have some problems. Could you please help me with them?

The first one is about the particle number used in simulations. For example, I noticed that in the .param files, the author provides an option by: `#ifdef PARAM_SINGLE_PARTICLE CreateDensity< densityProfiles::FreeFormula, startPosition::OnePosition, PIC_Electrons

,

else

CreateDensity< densityProfiles::GaussianCloud, startPosition::Random, PIC_Electrons , So according to theFreeFormuladefined indensity.param, in the case I#define PARAM_SINGLE_PARTICLE 1indensity.param, I am supposed to be able to simulate Thomson scattering of a single electron with the laser. But when I run the simulation, I got output sayingPIConGPUVerbose PHYSICS(1) | macro particles per device: 1572864. Considering I used 32 CPUs, the grid size is 128*3072*128, so I have particles in every cell! And the simulation is quite slow, while the Thomson scattering of one single electron is supposed to be very quick. I also notice that for density profile asGaussianCloud, in most of the cells where the density is 0, there are still 6 (defined byTYPICAL_PARTICLES_PER_CELL`) particles. So my first question is, shouldn't the particle number of the cells where density is 0 , is also 0?

The second one is about the GPU. We installed a Nvidia 2080ti GPU on the workstation and install PIConGPU with CUDA by spack. I changed npernode to 1, but when I run the simulation, I received an error:

Invalid MIT-MAGIC-COOKIE-1 key--------------------------------------------------------------------------

Your job has requested more processes than the ppr for
this topology can support:

  App: /home/sztu/runs/cuda_thomson/input/bin/cuda_memtest.sh
  Number of procs:  32
  PPR: 1:node

Please revise the conflict and try again.

Could you help with it? Thanks a lot!

sbastrakov commented 3 years ago

Hello @prlWanted ,

Regarding the single particle definition. From the logical point of view, that setting is supposed to mean one physical and one macroparticle per cell, not per simulation. FYI the number in output PIConGPUVerbose PHYSICS(1) | macro particles per device: 1572864 in only an estimate assuming homogeneous density (that we mostly use to see it anything is clearly wrong or not), so generally one cannot trust it to be the exact number. However in this case it should be the exact number and indeed means one macroparticle per cell.

Also, from a technical side you seem to have chosen an unsafe way of defining PARAM_SINGLE_PARTICLE, as it is not guaranteed that this define is seen by all the files that use it. So you may have ended up with an inconsistent configuration where some files "saw" it set to 1, and others set to 0. A safe way to set such definitions is externally relative to the source code itself, by modifying file cmakeFlags. By default PIConGPU uses settings from flags[0], you could put the necessary defines there following the style of other options. I guess you could copy contents of flags[3] to flags[0] for this kind of a test.

sbastrakov commented 3 years ago

Regarding the GPU question, how many GPUs do you have on the system? It seems you requested to use 1 GPU per node so that requires 32 nodes then, and normally workstations are viewed as a single node with multiple GPUs.

prlWanted commented 3 years ago

@sbastrakov Thanks for you quick reply. I changed the flags as you commented, but the simulation is still quite slow. I have 1 GPUs installed on the workstation, but we are going to have a GPU supercomputer as they said. I hope I can use PIConGPU there.

sbastrakov commented 3 years ago

So with one GPU you could still run the example, but have to use a smaller grid so that it fits GPU memory. To do so, you can just modify your .cfg file, no recompilation of PIConGPU needed. Namely, these lines: Variables TBG_devices_x, y, z mean the number of MPI processes (and each MPI process uses 1 GPU) along x, y, z in a Cartesian geometry, for 1 GPU the only valid option is to set all these values to 1. Then TBG_gridSize is number of cells along x, y, z, I guess you should decrease somewhat proportionally, so that the product is about 32 times smaller than the current value.

prlWanted commented 3 years ago

@sbastrakov Hi. It still doesn't work. I did what you said. But I received an error:

Invalid MIT-MAGIC-COOKIE-1 key[12/07/2020 19:23:29][sztu-HP-Z8-G4-Workstation][0]:ERROR: CUDA error: CUDA driver version is insufficient for CUDA runtime version, line 279, file /home/sztu/src/spack/opt/spack/linux-ubuntu20.04-broadwell/gcc-5.5.0/picongpu-0.5.0-l75qt2jgv3nxufzxt4gteekhgaccdk5b/thirdParty/cuda_memtest/cuda_memtest.cu
cuda_memtest crash: see file /home/sztu/runs/cuda_thomson_001/simOutput/cuda_memtest_sztu-HP-Z8-G4-Workstation_0.err
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33631,1],0]
  Exit code:    1

It seems a problem of the memory, my CUDA version is cuda@10.2.89.

sbastrakov commented 3 years ago

So now the first issue is resolved and the cuda_memtest program that we run just before PIConGPU is starting. And it fails because of the drivers, not memory (that program is supposed to check memory, but it can't start). Very likely so would PIConGPU for the same reason. Now you need to check what driver version is needed for the CUDA version you use, and update your system to that version. Or if you cannot update driver, can also downgrade CUDA version used. So you need to make CUDA version compatible to driver version somehow.

prlWanted commented 3 years ago

Thanks, I will try.

PrometheusPi commented 3 years ago

@prlWanted regarding your question about #define PARAM_SINGLE_PARTICLE 1: If set correctly, you will branch into a very specific density definition that, as @sbastrakov already mentioned, means one macro-particle per cell which also only represent one real electron. However, since the density definition is only non-zero in a single cell, the entire simulation should only contain one macro particle ( = one real electron). Thus in this case, single particle means really one electron per simulation. The number you have seen is just a estimate by PIConGPU assuming one particle-per-cell. Since in your 32.cfg you use 128 x 3072 x 128 cells, this means more than 50,331,648 cells and thus for each of your 32 devices: 1,572,864 particles. The output is just an upper estimate and not aware of zero density in the density definition. You can check the real number of macro particles on run time via the output of the macro particle counter that should be activated in the 32.cfg. For that look in simOutputfor a file named e_macroParticlesCount.dat. It should only count a single macro-particle.

sbastrakov commented 3 years ago

Ah, then I misunderstood the settings. Thanks for correcting @PrometheusPi .

prlWanted commented 3 years ago

@sbastrakov @PrometheusPi Thanks a lot. I finally got it run on GPU, but the particle number can not be too large or I get an error: "mallocMC", out of memory. I need to carefully adjust the particle number. Or can I first run the 2D simulation or Thomson scattering?

sbastrakov commented 3 years ago

Nice to hear @prlWanted .

So adjusting the grid size and particle number to fit device memory is a constant need when using PIConGPU. Especially when running on GPUs one really wants to have device memory to be nearly fully used. This comes partially from experience, partially from trial and error, and we also have a helper tool lib/python/picongpu/utils/memory_calculator.py to estimate memory requirements for a given grid size and number of macroparticles per cell, assuming that is constant.

For 2D, I am not sure if it makes sense physically for this setup, I think @PrometheusPi could comment on that. Technically (for any setup) dimensionality is set in file dimension.param. In case your setup does not have this file, you can pic-edit to create it.

PrometheusPi commented 3 years ago

I agree with @sbastrakov, fitting your simulation volume to the given resources is a complex task. The tool @sbastrakov recomanded is definitly idealy suited for this task. For details please see the documentation of the memory calulator. If you are running a single particle setup, you can also neglect the memory required by that single particle. (Be aware that however, that a single particle allocates already an entire particle frame, equivalent to 256 particles - however , this is still negligible for the single particle setup.)

For 2D radiation calculations: As long as your observation angles stay within the plain you simulate, a 2D setup should be fine. If your observation directions however points outside the plain of simulation, the scalar product between position and observation direction will result in numerical artifacts when evaluation the spectrally resolved Liénard Wiechert potentials. Furthermore, please be aware that the Liénard Wiechert potentials used assume a 3D space, even in a 2D simulation setup, thus electromagnetic fields will vanish as in 3D.

prlWanted commented 3 years ago

@sbastrakov @PrometheusPi Thanks, I will try.

prlWanted commented 3 years ago

Dear Developers, Hi. We are trying to use PIConGPU on the CPU-based supercomputer. But we have some problems with the parallel running. We notice that there are several optional backends when doing pic-build -b, what are the differences between omp2b, serial and 'threads', please? When mpirun of OPENMPI is used, how do we specify the arguments of mca, like ^tcp, openib or self? Could you provide a template script of job submitting on a CPU-based supercomputer? Thanks a lot.

PrometheusPi commented 3 years ago

Hi @prlWanted, the various backends come from alpaka our parallelization library. A list and description of all backends can be found in the the documentation including serial, threads, and omp2b.

An example for an opm2b backend (to be precise omp2b:skylake-avx512) is given here: *.tpl and setup.

sbastrakov commented 3 years ago

So as @PrometheusPi answered above, -b is used to specify alpaka backend to be used internally, and therefore what kind of hardware it can use. For PIConGPU on CPUs please use omp2b or a more clarified omp2b:[architecture].

Regarding the options specific to OpenMPI. We also faced the need to pass those on some systems. And normally do so via the .tpl file, like here. Since we (or our users normally) make a .tpl file for each system, or a partition of a cluster, this way it only needs to be figured out and specified once.

steindev commented 3 years ago

@prlWanted Did you succeed in runnning PIConGPU? Can this issue be closed?

prlWanted commented 3 years ago

@prlWanted Did you succeed in runnning PIConGPU? Can this issue be closed?

Yes, thanks to the developers' help, it has been successfully run. Please close it.

ComputationalRadiationPhysics / picongpu

PIConGPU without GPU? #3177

else