Julia implementation - Githubissues

tom91136 commented 3 years ago

This PR adds the Julia implementation of BabelStream with the following implementations:

PlainStream.jl - Single threaded for
ThreadedStream.jl - Threaded implementation with Threads.@threads macros
DistributedStream.jl - Process based parallelism with @distributed macros
CUDAStream.jl - Direct port of BabelStream's native CUDA implementation using CUDA.jl
AMDGPUStream.jl - Direct port of BabelStream's native HIP implementation using AMDGPU.jl

See README.md for details on build and run instructions.

Performance is surprisingly good across all supported hardware platforms. All benchmarks uses Julia 1.6.1, specific versions of each package are available in Manifest.toml

omp cuda

AMDGPU.jl is currently still in heavy development, although the project reports most core features are working. At the time of this PR, there are still a few issues that makes it unsuitable for production use:

Kernel performance is inconsistent (tested on on Radeon VII w/ ROCm 3.10)
API for device selection is missing, the implementation had to resort to introspection
Rapid kernel submission seems to cause ref-count overflow issues if we attempt to wait for each kernel to complete, this is a bit counter-intuitive. The workaround is to use hardware-based events for synchronisation.

hip

Finally, there isn't a process-based (e.g MPI) implementation of BabelStream so comparison for DistributedStream.jl has been omitted. That said, performance seems to be significantly worst than ThreadedStream.jl due to the added serialisation overhead.

Future work

We should be able to include oneAPi.jl once it is ready for general use.

There's also OpenCL.jl but it simply wraps the OpenCL host API; kernels must still be written in OpenCL C, so this wouldn't be any different from BabelStream's OCLStream.

tom91136 commented 3 years ago

Ready for review again.

tomdeakin commented 3 years ago

Thanks @tom91136. I think I prefer the parameter passing rather than making a structure just to hold the arrays. I think that in a larger code with more arrays, you're just going to have to pass things around rather than keep wrapping things up in bundles to pass to different functions. Ideally we're aiming to write BabelStream in a way that is representative of something much bigger.

giordano commented 3 years ago

Performance is surprisingly good across all supported hardware platforms.

:smiley:

Is there something we can do to move this forward? I had a very quick look, but could do a more thorough review, if that helps

tom91136 commented 3 years ago

@giordano Thanks for the review! This PR will be used for an upcoming submission to PMBS so I got a few more local changes (I've added a functional oneAPI.jl and KA implementation) that I'm in the process of finalising. I'll incorporate your review and put up a final version for further review by the end of the week. If you're interested, the PMBS submission will also include a compute bound benchmark written in Julia.

@tomdeakin and I had a discussion on the parameter passing and I think we've settled on the current approach being acceptable.

UoB-HPC / BabelStream

Julia implementation #106

Future work