Looking for technical documentation

etiennemlb commented 2 years ago

Hi, I took a look at the Github wiki and Readthedocs without finding much about the implementation details for PIConGPU.

I'm looking for the particle to grid deposition algorithm. This is not complete enough, much like Radiative Signatures of the Relativistic Kelvin-Helmholtz Instability.

I tried looking at the code, but the implementation seemed very scattered to me and I couldn't glue the pieces together in the short amount of time I had. Where can I find such information in the doc or the code ?

Thanks

sbastrakov commented 2 years ago

Hello @etiennemlb ,

We have two types of particle-grid deposition type operations. One is current deposition as a core part of PIC computational loop. There we implement several algorithms and the implementations are heavily parametrized as we use different strategies depending on hardware (and potentially problem in question) for performance reasons. It is indeed far from easiest-to-read part of PIConGPU. If you could formulate a more concrete question (e.g. what is our algorithm or reasoning, or where is implementation of X), we would gladly explain or point to relevant part of the code or docs.

Another type of particle-grid operation is to calculate (normally for purposes of output) aggregated quantities, e.g. gridded average energy. This one is not in hot loops and is conceptually much more simple. Code-wise it is almost completely independent from current deposition, only implementation of macroparticle shapes / assignment functions is shared.

etiennemlb commented 2 years ago

@sbastrakov :

what is our algorithm or reasoning, or where is implementation of

I'm looking for a high level pseudocode of the implementation of the particle to grid current deposition as part of the classical PIC loop. In particular, what could one do to minimise the effects of "particle drifting" leading to random access on the grid. On CPU, domain decomposition and a coarse spatial sorting could do the trick (right ?). On GPU, sorting is not trivially done and global memory atomic ops are quite costly. What are you guys doing to keep the GPU busy ? I'm not talking about "simple" coms/computation overlap but what are the differences between "classical" CPU implementation and, for instance PIConGPU's implementation.

If the algorithms implemented in PIConGPU are so different one from the other, I'd especially like to get more information on the Esirkepov implementation.

sbastrakov commented 2 years ago

Thanks for clarifications @etiennemlb . I will reply in detail tomorrow in German office hours.

For a very quick reply (to be extended tomorrow). PIConGPU uses supercells, normally of 256 nearby cells on GPUs. Each supercell has independent storage of macroparticles located there. Supercells provide spatial locality, no additional sorting inside supercell.

Our typical parallel pattern of particles-grid kernels (current deposition, field interpolation & particle pusher) is block per supercell, parallel loop over particles in global memory (as it's streaming access effectively) with local fields on shared memory. This is typical but not only pattern.

The basic approach is described in SC'13 paper, a more modern overview is in 2019 PhD thesis of Axel Huebl.

sbastrakov commented 2 years ago

So to expand my reply from yesterday.

PIConGPU core approach for organizing macroparticles and avoid random global memory access in particle-grid operations is supercells. Supercells are just logical Cartesian blocks of nearby cells of the same size. Each supercell has a separate storage for particles that are in those cells, in array-of-structure-of-arrays fashion. What PIConGPU papers and code call frames is a storage for a fixed number of macroparticles (so the "SoA" in AoSoA i mentioned), and each supercell is an arbitrary sized collection of frames.

Basically each time in the PIC loop we handle supercells independently, except one special bookkeeping stage. After particle pusher we have to keep the data structure in order by tracking "migrating" macroparticles and moving them to target supercells. Because supercells are not single cells (and depending on simulation case of course), there is relatively small % of macroparticles changing their supercells during each time step. But still this bookkeeping is imo very difficult to implement reasonably on GPUs, but over the years PIConGPU accumulated lots of optimizations there and now it is quite good and more than pays for itself elsewhere.

Thus, we do not strictly sort macroparticles per se, but instead sufficiently locally group them. So that while different macropartticles in the same supercell may use different field and current density values, still a supercell overall only uses a small and compact part of global fields. This compact part fits shared memory (in CUDA and HIP terms) size on GPUs. In PIConGPU their default size is 8x8x4, so 256 cells, but it is configurable. And the limitation on supercell size is that those local fields fit shared memory.

As far as I am aware, some grouping like that is also used on CPUs nowadays, e.g. to my (far from expert) understanding it is one of (but not only) functions of SMILEI particle patches, and generally makes a lot of sense wrt CPU threading + vectorization parallelism. Well, and PIConGPU also runs on CPUs with same approach depending on how you compile it. And straight global sorting is imo a classical approach that probably exists in some codes, but generally loses popularity at least for the flavor of PIC that we do (edit: I initially thought you meant "sorting" in this more narrow sense).

In an oversimplified view (see a more realistic explanation for Esirkepov below) to demonstrate the basic idea of PIConGPU particle-grid operations:

We process supercells independently and in parallel via alpaka blocks (= CUDA/HIP blocks when running on GPUs)
Field values involved for the supercell (so for current deposition it's resulting J values in and slightly around the supercell) are stored in "hot" shared memory
Macroparticles in a supercell are processed in parallel by threads of the block independently, their data is accessed from "cold" global memory about once per kernel (so it's somewhat like a streaming access)
Each macroparticle uses many field values in shared memory due to its assignment function

So now on how is it actually applied for Esirkepov current deposition. As I mentioned before, there is no "standard PIConGPU" implementation of Esirkepov deposition, as we know and measured that the best way to implement it depends on the problem and system. And due to new hardware appearing, also also software e.g. new HIP versions for AMD, their properties change. Thus, we implement it in a parametrized fashion, as a set of combinable building blocks for different aspects of the implementation. This is why we cannot point you to a single function that does whole Esirkepov.

The most important customization point of current deposition is how do we add up contributions of different macroparticles. It concerns on which level of memory do we perform reduction inside the supercell - shared like in the simpilfied scheme described above or global. But not just level of memory, also how we organize parallel processing of supercells at all - in a one parallel loop over all supercells with reduction of their (already reduced) results or in a series of checkerboard-style parallel loops with no need for reduction. These two aspects are controlled by choosing one of there strategies. There is also another customization point on how macroparticle shapes are evaluated - it is less "algorithmical" than the reduction and more just performance tuning, but also important.

In principle, as usual for PIConGPU, a user is free to configure anything, e.g. to make a custom combination of those blocks. We provide good default combinations depending on the platform here, they are done according to our measurements on different systems.

As a note here, I think your statement

global memory atomic ops are quite costly

used to be the case in the past, but probably no longer for modern GPUs. Since we (well and any user) can easily switch to reducing to global memory, it is not bad. Of course, that depends on particular case, I am just saying that GPU global memory atomics are no longer an automatic no.

@psychocoderHPC is a guru in this part, perhaps he could add something to my explanations or answer some more technical questions.

etiennemlb commented 2 years ago

Hi, Thanks a lot for taking the time to answer.

By sorting I meant binning as discussed by this old paper: Fast parallel particle-to-grid interpolation for plasma PIC simulations on the GPU. I interpret this as a coarse spatial particle sorting.

What my mental model is for global memory atomic ops are quite costly is the need to load from global memory + the cost of atomic ops (serialized) + potentially invalidating the a cache line (see MESI protocol, I dont know if it exists on GPU). With shared memory atomic, you remove cache and global memory load issues. I'd love to get more insight on the subject though :).

Again, thanks for the details.

sbastrakov commented 2 years ago

By sorting I meant binning as discussed by this old paper

Ah, then yes, our approach to particle organization is generally similar to that paper. Details differ e.g. we have a collection of frames for each supercell / bin, but those are more on technical rather than principle side.

Regarding the atomics on GPUs, hopefully @psychocoderHPC can comment.

psychocoderHPC commented 2 years ago

Hi, Thanks a lot for taking the time to answer.

By sorting I meant binning as discussed by this old paper: Fast parallel particle-to-grid interpolation for plasma PIC simulations on the GPU. I interpret this as a coarse spatial particle sorting.

What my mental model is for global memory atomic ops are quite costly is the need to load from global memory + the cost of atomic ops (serialized) + potentially invalidating the a cache line (see MESI protocol, I dont know if it exists on GPU). With shared memory atomic, you remove cache and global memory load issues. I'd love to get more insight on the subject though :).

Again, thanks for the details.

Global memory atomics is more costly as you said. Global memory atomics will be performed in L2/L3 memory of the GPU, this memory is shared between all SM and is not really large. For sure atomic operations will invalidate L1 caches. The advantage of performing shared memory atomics is that you avoid atomics serialization because you composed your domain already because only threads within a block can access the same shared memory. Depending on the GPU generations the shared memory atomics are built-in hardware functions. If I remember correctly NVIDIA is also pre-fetching memory for shared memory atomics.

Keep in mind the performance of atomics depends also very strongly on the access pattern of your algorithm.

In PIConGPU we try to use always shared memory atomics if possible.

Maybe you are also interested in the short discussion I had with AMD about emulated atomics based on TTAS and TAS: https://github.com/ROCm-Developer-Tools/HIP/pull/1816 Unfortionally I never got an answer to which access pattern they used to benchmark their atomic functions.

etiennemlb commented 2 years ago

In PIConGPU we try to use always shared memory atomics if possible.

Is it correct to say that the particule to grid deposition kernel's computation is bound by the speed at which you can do the atomic shared memory operations ?

sbastrakov commented 2 years ago

If one employs such a high level view, I would say so. Maybe with additional clarifications that

it is not the only way PIConGPU can do current deposition in principle, but what works best for us on current GPUs and types of simulations we do (so in principle if there was a GPU with same shared memory and much better global memory, we would switch to another strategy and not rely on shared memory atomics or even shared memory at all)
it's not just "raw" atomic performance, but an aggregated one also combined with shared memory throughput and shared memory instruction throughput; at least on V100 we observed that we may be close to cap on the instructions, but not in bytes/s.

In general I would say that reality appears more complicated than just viewing atomics or shared memory as a single main factor. For example, the integer index operations we do for accessing 2d/3d arrays are definitely non-negligible, as we do lots of them and they may share same resources with FP32 operations depending on a GPU type. So if one only views everything as FLOP/s, those would be appear as a large portion of unaccounted work in that model.

steindev commented 2 years ago

@etiennemlb is your original question answered? If so, please close the issue.

ComputationalRadiationPhysics / picongpu

Looking for technical documentation #4199