Closed athas closed 9 months ago
Perhaps the CI scripts should be updated as well?
The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)
The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)
We do have CIs setup for the Rust, Julia, and Java variant so it would be good to have it for Futhark too. Since it's using CMake, it would follow the CI steps like all other C++ implementations, see https://github.com/UoB-HPC/BabelStream/blob/main/src/ci-test-compile.sh and https://github.com/UoB-HPC/BabelStream/blob/main/.github/workflows/main.yaml.
Alright, I'll add a CI step for Futhark.
I have added a very simple Futhark action. It only tests a single version of cmake (the preinstalled one). If you want, I can also try to fit it into the C++ framework, but I don't think it's worth it (and might make it more difficult to test other Futhark backends).
Thanks for the PR and sorry for the late reply, I'm taking a look now (running benchmarks, etc), one thing I've noticed is the lack of a device enumeration API in the Futhark runtime (but apparently you can set devices, or even be presented a dialog in the OpenCL case), this was a bit problematic as I'm testing machines with more than one OpenCL platform. As a workaround, we may have to implement the device enumeration by replicating the logic in the OpenCL and CUDA models.
It wouldn't be difficult to add. I can take a swing at it.
@athas Do have to say, the generated C code for multicore CPU is more readable than a good portion C libraries out there!
Then the state of C libraries is more dire than I thought.
The Futhark-generated OpenCL/CUDA APIs allow one to select a device by index, but not to enumerate all devices (except through the menu). Would it be OK to only implement the selection, but not the enumeration?
I can just copy the device enumeration code from the cuda
and ocl
implementations if you would prefer to have full functionality.
I have added device selection now.
I've got benchmark results for a few platforms. For Nvidia A100, it's on-par with the native CUDA/OpenCL implementation:
Which is excellent. For the multicore backend, I think the runtime is lacking NUMA awareness. On a local Ryzen 5900X (1 NUMA domain) machine with dual channel DDR4 3400MT, I'm seeing comparable performance with OpenMP:
>./build/omp-stream --arraysize 536870912
BabelStream
Version: 4.0
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function MBytes/sec Min (sec) Max Average
Copy 25610.437 0.33541 0.36010 0.34216
Mul 25402.003 0.33816 0.35810 0.34445
Add 29010.697 0.44414 0.47372 0.45097
Triad 28878.115 0.44618 0.47433 0.45268
Dot 44377.658 0.19356 0.21538 0.20067
>./build/futhark-stream --arraysize 536870912
BabelStream
Version: 4.0
Implementation: Futhark (parallel CPU)
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function MBytes/sec Min (sec) Max Average
Copy 26107.059 0.32903 0.35351 0.33425
Mul 25928.510 0.33129 0.35788 0.33712
Add 29328.709 0.43933 0.46747 0.44469
Triad 28930.892 0.44537 0.47550 0.45077
Dot 44656.475 0.19236 0.20958 0.19728
Performance is still good on a dual socket Intel Xeon Gold 6338 (2 NUMA domains) system compared to OpenMP:
But for a dual socket AMD EPYC 7713 (8 NUMA domains), the performance is quite poor:
I didn't do the scaling here due to time constraints.
Yes, the multicore backend is completely NUMA-ignorant. The GPU backends are much more mature.
LGTM, I'm trying to validate the CI by merging this with develop
. If you're OK with it, please check the Allow edits by maintainers
box in the PR and I'll push the merge.
Apparently it is impossible for me to do so for weird GitHub reasons. I can see about resolving the conflicts myself.
Thanks for the merge, I've approved it now, let's wait for CI.
CI fails due to an unrelated job running out of disk space. I wonder why that is not an issue on the develop
branch.
No worries, it's probably because we ran out of cache due to how big the compilers are (we download and untar NVHPC in the setup). Thanks again and sorry about the slow turnaround.
I'm at SC22 and noticed that BabelStream is pretty popular, and I really like the idea of polyglot benchmark suites. So, here's an implementation in Futhark. I don't know if this is too obscure a language, but it's easy to invoke from C and C++ and so can use the same tooling as all the C++ implementations.