Open shimwell opened 10 months ago
Welcome @shimwell to the GPU party!
Yes, unfortunately, the GPU branch here is not documented at all yet. To compile for an NVIDIA A100 with the LLVM clang compiler, you'd want your cmake line to look something like:
cmake --preset=llvm_a100 -Dcuda_thrust_sort=on -Dsycl_sort=off -Dhip_thrust_sort=off -Ddebug=off -Ddevice_printf=off -Doptimize=on -DCMAKE_INSTALL_PREFIX=./install ..
There are other presets available in the CmakePresets.json
file that you can browse if you're interested in different architectures.
Currently the NVIDIA nvhpc openmp compiler is not working for OpenMC (at least the last time I tried it 6 months ago), so I'd highly recommend using clang. I think v16 or newer of clang should work.
We do have a repo setup that has some scripts that make it easier for people to install and run OpenMC + dependencies at https://github.com/jtramm/openmc_offloading_builder
Another item to note is there are some new commands used to control the runtime behavior of OpenMC. If using:
openmc
The code will run on the host CPU in history-based mode pretty much as normal. There should be a report at the end of the code that confirms where/how the code ran.
To run on GPU in the more optimal event-based mode, you'll need to run something like:
openmc --event -i 2500000 --no-sort-surface-crossing
--event
flag is needed, as this will instruct OpenMC to use event based mode which will default to the GPU.-i
flag adjusts the limit on the number of particles that will be allowed to be in-flight at once on the GPU (as having an event kernel over e.g. 1 billion particles in a batch would use too much memory). Generally on the A100, you're best bet is to increase this value as high as you can until you run out of memory, although benefits do taper off after 10 million particles are in-flight. All particles in the batch will still be run regardless of this setting -- it just controls how many are run per GPU kernel call (with a lower in-flight count resulting in more kernel calls).--no-sort-surface-crossing
tells OpenMC not to bother sorting particles before the surface crossing kernel executes. You may want to experiment with this flag to see if it's helpful or not. I've found it tends to be really helpful for performance with fusion geometries, but tends to hurt performance a bit on fission geometries. Similarly, you can experiment with the --no-sort-fissionable-xs
and --no-sort-non-fissionable-xs
cross section flags to control sorting before those kernels as well. Generally, the sorting has a small up-front cost, but can make the kernels execute much more efficiently by improving the memory locality of threads within a warp/block.The last area to be aware of when coming from CPU is that much larger problem sizes (in terms of number of particles/batch) are required to saturate the GPU. Running 50,000 particles/batch will result in awful performance on the GPU -- typically performance doesn't saturate until 10 million particles/batch or more.
I'll leave this issue open to remind us to add in some documentation so that it's at least clear how to install/use the GPU offloading branch. Let me know though if you have other questions in the interim.
Super super I've been updating my script at the top of this issue to make it a bit closer to installing on a basic desktop with Nvidia GPU.
I was interested to see that there are presents for different GPUs which I had not anticipated.
I see these are all workstation cards, are there any any possibilities of adding desktop cards like the NVIDIA RTX A2000 or is a workstation card required
Desktop cards work as well! I've run locally on my RTX 3080. Note that some NVIDIA consumer cards have significantly reduced speed for FP64 operations, so performance may be considerably lower than for HPC cards. The llvm_v100
vs. llvm_a100
vs llvm_h100
presets differ only in which SM level is set, so the same presets may be re-used for consumer cards having the same SM version. We might also add a generic nvidia build as well, although I've usually noticed a non-trivial boost in performance when giving the proper SM version for the card.
Hi everyone over here on the GPU fork
I was just take a look at installing this and running a few models.
Is there any chance of a few pointers for getting up and started with the install on a NVIDIA card.
I checked to see if the install.rst has any hints or the CI has any hints but I couldn't see any
OMP_TARGET_OFFLOAD
specific instructions.so far I have this but should I add some args to the cmake step?