CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.02k stars 203 forks source link
blas blas-libraries clblas gemm gpu matrix-multiplication opencl

CLBlast: The tuned OpenCL BLAS library

Platform Build status
Windows Build Status
Linux/macOS Build Status
Test machine (thanks to ArrayFire) Test status
clblast-linux-nvidia-a100 Test Status
clblast-linux-nvidia-k80 Test Status
clblast-linux-nvidia-p100 Test Status
clblast-linux-nvidia-t4 Test Status
clblast-linux-nvidia-v100 Test Status
clblast-windows-amd-r9 Test Status
clblast-windows-nvidia-m6000 Test Status

CLBlast is a lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators. CLBlast implements BLAS routines: basic linear algebra subprograms operating on vectors and matrices. See the CLBlast website for performance reports on some devices.

The library is not tuned for all possible OpenCL devices: if out-of-the-box performance is poor, please run the tuners first. See the docs for a list of already tuned devices and instructions on how to tune yourself and contribute to future releases of the CLBlast library.

Why CLBlast and not clBLAS or cuBLAS?

Use CLBlast instead of clBLAS:

Use CLBlast instead of cuBLAS:

When not to use CLBlast:

Getting started

CLBlast can be compiled with minimal dependencies (apart from OpenCL) in the usual CMake-way, e.g.:

mkdir build && cd build
cmake ..
make

Detailed instructions for various platforms can be found are here.

Like clBLAS and cuBLAS, CLBlast also requires OpenCL device buffers as arguments to its routines. This means you'll have full control over the OpenCL buffers and the host-device memory transfers. CLBlast's API is designed to resemble clBLAS's C API as much as possible, requiring little integration effort in case clBLAS was previously used. Using CLBlast starts by including the C++ header:

#include <clblast.h>

Or alternatively the plain C version:

#include <clblast_c.h>

Afterwards, any of CLBlast's routines can be called directly: there is no need to initialize the library. The available routines and the required arguments are described in the above mentioned include files and the included API documentation. The API is kept as close as possible to the Netlib BLAS and the cuBLAS/clBLAS APIs. For an overview of the supported routines, see here.

To get started quickly, a couple of stand-alone example programs are included in the samples subfolder. They can optionally be compiled using the CMake infrastructure of CLBlast by providing the -DSAMPLES=ON flag, for example as follows:

cmake -DSAMPLES=ON ..

Afterwards, you can optionally read more about running proper benchmarks and tuning the library.

Full documentation

More detailed documentation is available in separate files:

Known issues

Known issues:

Contributing

Contributions are welcome in the form of tuning results for OpenCL devices previously untested or pull requests. See the contributing guidelines for more details.

The main contributing authors (code, pull requests, testing) can be found in the list ofGitHub contributors.

Tuning and testing on a variety of OpenCL devices was made possible by:

Hardware/software for this project was contributed by:

More information

Further information on CLBlast is available through the following links:

How to cite this work:

Cedric Nugteren. CLBlast: A Tuned OpenCL BLAS Library. In IWOCL'18: International Workshop
on OpenCL. ACM, New York, NY, USA, 10 pages. 2018. https://doi.org/10.1145/3204919.3204924

Support us

This project started in March 2015 as an evenings and weekends free-time project next to a full-time job for Cedric Nugteren. You can find contact information on the website of the main author.