ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
710 stars 218 forks source link

PicOnGPU on AMD Radeon MI100 (Rocm) - Question #3838

Closed denisbertini closed 3 years ago

denisbertini commented 3 years ago

Dear PicOnGPU developper, This is just a question not an issue. Our cluster @ GSI darmstadt is now being upgraded with new nodes having AMD Radeon Infinity MI100. (4 GPUs/nodes) I would like to know if it would possible to use PicOnGPU on such GPU hardware. According to the work done at Oak Ridge, https://www.amd.com/system/files/documents/oak-ridge-national-laboratory-picongpu.pdf PicONGPU hase been converted using HIP and already tested on such a hardware. But i prefer to ask the opinion of the developpers directly about possible problems or/and limitations using PicOnGPU on such a hardware. If feasible i would give it a try !

sbastrakov commented 3 years ago

Hello @denisbertini and thanks for your interest to PIConGPU.

Indeed, PIConGPU (and libraries alpaka and cupla it is using) were ported to support AMD software stack. So it is tested and running there for quite some time now. As far as I know, we didn't yet have any external user on AMD systems, only team members and collaborators on that project. However, we have a set of environment settings for the Spock machine here which may serve as a reference point for you. In case you have issues or questions please contact us.

denisbertini commented 3 years ago

Hi sergei,

Do you know how these libraries (PicOnGPU, alpaka and culpa) have been ported? Did the people in Oak Ridge used HIP converter from Rocm? Is there somebody i can contact for details?

sbastrakov commented 3 years ago

Hi @denisbertini ,

There is an intersection between developers (people and organizations) of PIConGPU, alpaka, and cupla. Originally all these projects started from the computational radiation physics group at HZDR. All 3 are developed in open source fashion, and the porting was also done like that, largely by "normal" developers of that software. (With some external help, but definitely not by someone else.) You are welcome to contact each project via its github issues, we always tend to use them generously as discussion points.

To describe the porting briefly, alpaka was an existing portability library. The idea is that a client code writes its computational kernels using alpaka's abstractions and C++ templates, and then at compile time alpaka maps the abstractions to particular backend implementations, like CUDA or OpenMP, or HIP. So all the implementation-specific details are inside of alpaka and not spread through a client app. And the client can stay with a single (C++ templated) source code, that can be compiled for different platforms. So to support AMD alpaka added another backend for HIP already some time ago (there is also work on OpenMP 5 offloading backend, but for now we use only HIP on AMD). Note that it means that clients almost automatically could make use of HIP as well. So no converters were needed there.

Cupla is roughly a CUDA-like interface on top of alpaka, so it mostly changes the interface and not add functionality. So cupla is maybe a simpler way to port a native CUDA code to run on any platform alpaka supports. In that way, it is somewhat similar in purpose to the hipify converter script which i think you meant. But the mechanics of conversion is very different.

So PIConGPU was already written in a single-source portable manner, and all computational kernels were already alpakafied/cuplafied. This pre-existing portability allowed a relatively small port on the PIConGPU side: the kernels are unchanged, only small parts of host side logic had to be adjusted (e.g. mapped memory) and some optimizations put on. Of course, there were some bugs discovered and fixed during the process, but mostly it was done by power of alpaka (which, again, some PIConGPU developers also contribute to).

psychocoderHPC commented 3 years ago

@denisbertini There is a paper from 2016 where we short described how we ported PIConGPU to alpaka link If the application is already using cupla/alpaka moving to AMD via the alpaka back-end HIP is coming nearly for free. We are one of 6 groups in the Oak Ridge CAAR project. This project provides us with direct connections to vendors to have better ways to report bugs and have access to hardware. Alpaka was extended to support HIP by Helmholtz-Zentrum Dresden Rossendorf, CASUS and open source developers.

Did the people in Oak Ridge used HIP converter from Rocm?

The HIP converter is useful for projects which should switch fully to HIP. For software that should be maintained over years there is no way around an intermediate library like alpaka, RAJA, kokkos, ...

Is there somebody I can contact for details?

If you like you can contact me via mail and we can setup a VC with PIConGPU and alpaka developers to answer your questions.

steindev commented 3 years ago

There were additional questions via mail. I will document my answers here as well and close the issue. @denisbertini please re-open or open a new issue if you have more questions regarding this topic

There were two questions:

Regarding cores to GPU ratio, one to one is sufficient, as the cores do not perform any simulation calculations when the simulation is compiled for GPU runs. Regarding the number of GPUs for a simulation, this of course depends very much on the use case. In principle you can use one to any number of GPUs (if running on GPUS for the simulation. PIConGPU can run solely on CPUs when compiled for CPUs runs) with the largest number of of gpus we have ever used were ~27600. There is one exception to what I just said. If you do a moving window simulation, where the volume within which calculations are performed moves with the speed of light to co-propagate with a laser or particle beam, a minimum of 2 gpus is required. Now I guess you still want me to tell you some ballpark numbers? First I have to say, for publications we do highest resolution runs to keep numerical noise as small as possible. One could perform the same simulations with less resolution and they can still be meaningful, but we can not be sure and therefore we usually performed high resolution runs to be on the safe side. Second, in terms of number of GPUs and runtime there is generally a difference laser plasma acceleration of electrons and laser plasma acceleration of ions. In electron acceleration plasmas are orders of magnitude thinner which therefore generally have a lower requirement on resolution. Currently ongoing (high-resolution) simulation campaigns of electron acceleration use a few hundred gpus to sample the simulation volume and run on the order of 24h to simulate acceleration over a time frame of about 200k time steps (several millimeters of gas in which electrons are accelerated) Ion acceleration plasmas have solid density (laser on Al or Cu foil for example) generally requiring higher resolution to resolve plasma dynamics compared to electron acceleration setups. With ion acceleration, the simulation volume (in µm^3) is on the other hand smaller than for electron acceleration. Typical numbers of lately performed (high-resolution) ion accelerations setups are around one thousand gpus running about 12h for 50k time steps.

But take this with a MOUNTAIN of salt, as I said, these are our cutting edge highest resolution runs that we perform these days where we set up digital twins of currently ongoing experimental campaigns. Accordingly these model the full volume of the experiment. Simulation of toymodels and/or isolated effects and/or smaller scale phenomena usually require orders of magnitude less simulation volume and correspondingly less gpus (while keeping the resolution equal). On the other hand, for exploratory parameter scans we typically use lower resolution too (with corresponding lower requirement on number of gpus) and then perform only high resolution runs at a few interesting parameter sets which are then full volume and highest resolution and therefore as big as the numbers given above. All of this in the end extremely depends on the physical scenario which is modeled in the simulation, it is impossible to provide universal/standard numbers.

To be more concrete with respect to your available hardware: Currently we do have access to a similar system (https://docs.olcf.ornl.gov/systems/spock_quick_start_guide.html) with 4xMI100 per node and SLURM as a scheduler where we routinely perform smaller scale simulations.

From what you said, I do not see any hurdles running PIConGPU there.

Furthermore, in order to estimate the required hardware for a simulation, PIConGPU provides a memory calculator (https://picongpu.readthedocs.io/en/latest/usage/workflows/memoryPerDevice.html) with which you will be able to get an idea at which resolution, domain decomposition, macro-particle number per cell, and simulation volume you can run on your system.