request for GPU specific install instructions

shimwell commented 10 months ago

Hi everyone over here on the GPU fork

I was just take a look at installing this and running a few models.

Is there any chance of a few pointers for getting up and started with the install on a NVIDIA card.

I checked to see if the install.rst has any hints or the CI has any hints but I couldn't see any OMP_TARGET_OFFLOAD specific instructions.

so far I have this but should I add some args to the cmake step?

ubuntu 22.04 install instructions for Nvidia GPU

sudo apt-get update
sudo apt-get install -y git
sudo apt-get install -y cmake
sudo apt-get install -y build-essential
sudo apt-get install -y libhdf5-serial-dev
sudo apt-get install -y gcc
sudo apt-get install -y llvm
# python3 3.10.12 is installed on ubuntu 22.04 by default but missing pip
sudo apt install python3-pip
# solves omp.h' file not found error
sudo apt-get install -y libomp-dev

# assuming graphics drivers are installed so commented out
# sudo apt update
# sudo apt upgrade 
# sudo apt install nvidia-driver-535
# sudo apt install ubuntu-drivers-common
# reboot now

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
. ~/.bashrc
nvcc -V

git clone https://github.com/exasmr/openmc.git
cd openmc
mkdir build
cd build
# perhaps need to tell cmake to use llvm compiler
cmake -Doptimize=on -Dcuda_thrust_sort=on CMAKE_C_COMPILER=clang -D CMAKE_CXX_COMPILER=clang++ ..
make -j
cd ..
pip3 install .

# download nuclear data
wget https://anl.box.com/shared/static/vht6ub1q27hujkqpz1k0s48lrv44op0v.tgz
tar -xzvf vht6ub1q27hujkqpz1k0s48lrv44op0v.tgz
export OPENMC_CROSS_SECTIONS=${PWD}/nndc_hdf5/cross_sections.xml

export OMP_TARGET_OFFLOAD=MANDATORY

jtramm commented 10 months ago

Welcome @shimwell to the GPU party!

Yes, unfortunately, the GPU branch here is not documented at all yet. To compile for an NVIDIA A100 with the LLVM clang compiler, you'd want your cmake line to look something like:

cmake --preset=llvm_a100 -Dcuda_thrust_sort=on -Dsycl_sort=off -Dhip_thrust_sort=off -Ddebug=off -Ddevice_printf=off -Doptimize=on -DCMAKE_INSTALL_PREFIX=./install ..

There are other presets available in the CmakePresets.json file that you can browse if you're interested in different architectures.

Currently the NVIDIA nvhpc openmp compiler is not working for OpenMC (at least the last time I tried it 6 months ago), so I'd highly recommend using clang. I think v16 or newer of clang should work.

We do have a repo setup that has some scripts that make it easier for people to install and run OpenMC + dependencies at https://github.com/jtramm/openmc_offloading_builder

Another item to note is there are some new commands used to control the runtime behavior of OpenMC. If using:

openmc

The code will run on the host CPU in history-based mode pretty much as normal. There should be a report at the end of the code that confirms where/how the code ran.

To run on GPU in the more optimal event-based mode, you'll need to run something like:

openmc --event -i 2500000 --no-sort-surface-crossing

Technically just the --event flag is needed, as this will instruct OpenMC to use event based mode which will default to the GPU.
The -i flag adjusts the limit on the number of particles that will be allowed to be in-flight at once on the GPU (as having an event kernel over e.g. 1 billion particles in a batch would use too much memory). Generally on the A100, you're best bet is to increase this value as high as you can until you run out of memory, although benefits do taper off after 10 million particles are in-flight. All particles in the batch will still be run regardless of this setting -- it just controls how many are run per GPU kernel call (with a lower in-flight count resulting in more kernel calls).
The --no-sort-surface-crossing tells OpenMC not to bother sorting particles before the surface crossing kernel executes. You may want to experiment with this flag to see if it's helpful or not. I've found it tends to be really helpful for performance with fusion geometries, but tends to hurt performance a bit on fission geometries. Similarly, you can experiment with the --no-sort-fissionable-xs and --no-sort-non-fissionable-xs cross section flags to control sorting before those kernels as well. Generally, the sorting has a small up-front cost, but can make the kernels execute much more efficiently by improving the memory locality of threads within a warp/block.

The last area to be aware of when coming from CPU is that much larger problem sizes (in terms of number of particles/batch) are required to saturate the GPU. Running 50,000 particles/batch will result in awful performance on the GPU -- typically performance doesn't saturate until 10 million particles/batch or more.

I'll leave this issue open to remind us to add in some documentation so that it's at least clear how to install/use the GPU offloading branch. Let me know though if you have other questions in the interim.

shimwell commented 10 months ago

Super super I've been updating my script at the top of this issue to make it a bit closer to installing on a basic desktop with Nvidia GPU.

I was interested to see that there are presents for different GPUs which I had not anticipated.

I see these are all workstation cards, are there any any possibilities of adding desktop cards like the NVIDIA RTX A2000 or is a workstation card required

jtramm commented 10 months ago

Desktop cards work as well! I've run locally on my RTX 3080. Note that some NVIDIA consumer cards have significantly reduced speed for FP64 operations, so performance may be considerably lower than for HPC cards. The llvm_v100 vs. llvm_a100 vs llvm_h100 presets differ only in which SM level is set, so the same presets may be re-used for consumer cards having the same SM version. We might also add a generic nvidia build as well, although I've usually noticed a non-trivial boost in performance when giving the proper SM version for the card.

exasmr / openmc

request for GPU specific install instructions #49