UoB-HPC / BabelStream

STREAM, for lots of devices written in many programming models
313 stars 109 forks source link

Add Futhark implementation #146

Closed athas closed 9 months ago

athas commented 1 year ago

I'm at SC22 and noticed that BabelStream is pretty popular, and I really like the idea of polyglot benchmark suites. So, here's an implementation in Futhark. I don't know if this is too obscure a language, but it's easy to invoke from C and C++ and so can use the same tooling as all the C++ implementations.

$ cmake -Bbuild -H. -DMODEL=futhark -DFUTHARK_BACKEND=opencl
$ cmake --build build
$ $ build/futhark-stream
Version: 4.0
Implementation: Futhark (OpencL)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1319634.622 0.00041     0.00141     0.00081
Mul         1338486.404 0.00040     0.00176     0.00081
Add         1355367.262 0.00059     0.00193     0.00098
Triad       1364525.941 0.00059     0.00196     0.00102
Dot         1315743.983 0.00041     0.00178     0.00079
Munksgaard commented 1 year ago

Perhaps the CI scripts should be updated as well?

athas commented 1 year ago

The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)

tom91136 commented 1 year ago

The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)

We do have CIs setup for the Rust, Julia, and Java variant so it would be good to have it for Futhark too. Since it's using CMake, it would follow the CI steps like all other C++ implementations, see https://github.com/UoB-HPC/BabelStream/blob/main/src/ci-test-compile.sh and https://github.com/UoB-HPC/BabelStream/blob/main/.github/workflows/main.yaml.

athas commented 1 year ago

Alright, I'll add a CI step for Futhark.

athas commented 1 year ago

I have added a very simple Futhark action. It only tests a single version of cmake (the preinstalled one). If you want, I can also try to fit it into the C++ framework, but I don't think it's worth it (and might make it more difficult to test other Futhark backends).

tom91136 commented 1 year ago

Thanks for the PR and sorry for the late reply, I'm taking a look now (running benchmarks, etc), one thing I've noticed is the lack of a device enumeration API in the Futhark runtime (but apparently you can set devices, or even be presented a dialog in the OpenCL case), this was a bit problematic as I'm testing machines with more than one OpenCL platform. As a workaround, we may have to implement the device enumeration by replicating the logic in the OpenCL and CUDA models.

athas commented 1 year ago

It wouldn't be difficult to add. I can take a swing at it.

tom91136 commented 1 year ago

@athas Do have to say, the generated C code for multicore CPU is more readable than a good portion C libraries out there!

athas commented 1 year ago

Then the state of C libraries is more dire than I thought.

The Futhark-generated OpenCL/CUDA APIs allow one to select a device by index, but not to enumerate all devices (except through the menu). Would it be OK to only implement the selection, but not the enumeration?

athas commented 1 year ago

I can just copy the device enumeration code from the cuda and ocl implementations if you would prefer to have full functionality.

athas commented 1 year ago

I have added device selection now.

tom91136 commented 1 year ago

I've got benchmark results for a few platforms. For Nvidia A100, it's on-par with the native CUDA/OpenCL implementation:

BabelStream 4 0 A100 40G (array size=4 3GB)

Which is excellent. For the multicore backend, I think the runtime is lacking NUMA awareness. On a local Ryzen 5900X (1 NUMA domain) machine with dual channel DDR4 3400MT, I'm seeing comparable performance with OpenMP:

>./build/omp-stream --arraysize 536870912
Version: 4.0
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        25610.437   0.33541     0.36010     0.34216
Mul         25402.003   0.33816     0.35810     0.34445
Add         29010.697   0.44414     0.47372     0.45097
Triad       28878.115   0.44618     0.47433     0.45268
Dot         44377.658   0.19356     0.21538     0.20067
>./build/futhark-stream --arraysize 536870912
Version: 4.0
Implementation: Futhark (parallel CPU)
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        26107.059   0.32903     0.35351     0.33425
Mul         25928.510   0.33129     0.35788     0.33712
Add         29328.709   0.43933     0.46747     0.44469
Triad       28930.892   0.44537     0.47550     0.45077
Dot         44656.475   0.19236     0.20958     0.19728

Performance is still good on a dual socket Intel Xeon Gold 6338 (2 NUMA domains) system compared to OpenMP:

BabelStream 4 0 Xeon Gold 6338 2 (Triad kernel, array size=4 3GB)

But for a dual socket AMD EPYC 7713 (8 NUMA domains), the performance is quite poor:

BabelStream 4 0 EPYC 7713 (array size=4 3GB)

I didn't do the scaling here due to time constraints.

athas commented 1 year ago

Yes, the multicore backend is completely NUMA-ignorant. The GPU backends are much more mature.

tom91136 commented 9 months ago

LGTM, I'm trying to validate the CI by merging this with develop. If you're OK with it, please check the Allow edits by maintainers box in the PR and I'll push the merge.

athas commented 9 months ago

Apparently it is impossible for me to do so for weird GitHub reasons. I can see about resolving the conflicts myself.

tom91136 commented 9 months ago

Thanks for the merge, I've approved it now, let's wait for CI.

athas commented 9 months ago

CI fails due to an unrelated job running out of disk space. I wonder why that is not an issue on the develop branch.

tom91136 commented 9 months ago

No worries, it's probably because we ran out of cache due to how big the compilers are (we download and untar NVHPC in the setup). Thanks again and sorry about the slow turnaround.