Closed tom91136 closed 2 years ago
Ready for review again.
Thanks @tom91136. I think I prefer the parameter passing rather than making a structure just to hold the arrays. I think that in a larger code with more arrays, you're just going to have to pass things around rather than keep wrapping things up in bundles to pass to different functions. Ideally we're aiming to write BabelStream in a way that is representative of something much bigger.
Performance is surprisingly good across all supported hardware platforms.
:smiley:
Is there something we can do to move this forward? I had a very quick look, but could do a more thorough review, if that helps
@giordano Thanks for the review! This PR will be used for an upcoming submission to PMBS so I got a few more local changes (I've added a functional oneAPI.jl and KA implementation) that I'm in the process of finalising. I'll incorporate your review and put up a final version for further review by the end of the week. If you're interested, the PMBS submission will also include a compute bound benchmark written in Julia.
@tomdeakin and I had a discussion on the parameter passing and I think we've settled on the current approach being acceptable.
This PR adds the Julia implementation of BabelStream with the following implementations:
PlainStream.jl
- Single threadedfor
ThreadedStream.jl
- Threaded implementation withThreads.@threads
macrosDistributedStream.jl
- Process based parallelism with@distributed
macrosCUDAStream.jl
- Direct port of BabelStream's native CUDA implementation using CUDA.jlAMDGPUStream.jl
- Direct port of BabelStream's native HIP implementation using AMDGPU.jlSee README.md for details on build and run instructions.
Performance is surprisingly good across all supported hardware platforms. All benchmarks uses Julia 1.6.1, specific versions of each package are available in
Manifest.toml
AMDGPU.jl is currently still in heavy development, although the project reports most core features are working. At the time of this PR, there are still a few issues that makes it unsuitable for production use:
Finally, there isn't a process-based (e.g MPI) implementation of BabelStream so comparison for
DistributedStream.jl
has been omitted. That said, performance seems to be significantly worst thanThreadedStream.jl
due to the added serialisation overhead.Future work
We should be able to include oneAPi.jl once it is ready for general use.
There's also OpenCL.jl but it simply wraps the OpenCL host API; kernels must still be written in OpenCL C, so this wouldn't be any different from BabelStream's OCLStream.