Mapping physics quantities onto hardware for efficient parallelization.
(CI code coverage is bad because the CI tests don't test the CUDA code; these tests need to be run manually.)
Run benchmarks:
julia --optimize=3 --project=@. --threads=$(nproc) kernels/bb.jl
Run with profiling:
/usr/local/cuda/bin/ncu -f -o profile --set full --target-processes all env LD_LIBRARY_PATH="/home/eschnett/julia-1.8/lib/julia:$LD_LIBRARY_PATH" ~/julia-1.8/bin/julia --optimize=3 --project=@. --threads=$(nproc) kernels/bb.jl
The juliaup-installed Julia does not work. One cannot extract PTX or SASS while profiling.
Intel Skylake vs. Nvidia Tesla V100 (numbers are estimates)
CPU term | count on CPU | count on GPU | GPU term |
---|---|---|---|
SIMD lane | 16 | 32 | thread |
SMT thread | 2 | 32 | warp |
superscalar | 6 | 2 | |
OOO | >100 | n/a | n/a |
thread | 40 | ? | block |
core | 40 | 80 | SM |
node | 1 | 1 | GPU |
scalar register | ? | ? | uniform register |
SIMD register | 2688 | 5? | register cache |
L1 cache | 8 kB | 256 kB | register |
L2 cache | 1 MB | 100 kB | L1 cache / shared memory |
L3 cache | ? | ? | L2 cache |
HBM | n/a | 32 GB | memory |
memory | 192 GB | n/a | host memory |