Welcome to the OpenCilk project!
OpenCilk is a state-of-the-art open-source implementation of the Cilk task-parallel programming platform. OpenCilk supports writing fast parallel programs using the Cilk task-parallel language extensions to C/C++. In addition, OpenCilk provides a platform to develop compilers, runtime systems, and program-analysis tools for task-parallel code.
This repository contains the source code for the OpenCilk compiler, which is based on the LLVM compiler infrastructure and implements the latest official version of Tapir, a compiler intermediate representation (IR) for task parallelism. By using Tapir, the OpenCilk compiler is able to optimize task-parallel programs more effectively than mainstream compilers. OpenCilk also contains an efficient parallel runtime library, that automatically schedules and load-balances the Cilk computation, and a suite of tools, for Cilk programs, including a race detector and a scalability analyzer.
This README provides a brief, noncomprehensive overview of how to get and use OpenCilk. This overview aims to help you get started using OpenCilk to write fast parallel programs in Cilk. For more information about OpenCilk, including installation guides, user guides, tutorials, and references, please see the OpenCilk website.
Supported systems: OpenCilk has been tested on a variety of modern x86-64 and 64-bit ARM processors, on recent versions of macOS and FreeBSD, and on a variety of modern Linux distributions.
Releases of OpenCilk are available from the OpenCilk releases page on GitHub. Precompiled builds of OpenCilk for some releases and target systems can be found on the same page. Instructions to install OpenCilk from a precompiled binary can be found on the install page on the OpenCilk website.
The scripts in the
OpenCilk infrastructure
repository make it easy to build a particular release of OpenCilk from
source. For example, the following steps will download the
OpenCilk release tagged <release_tag>
into the opencilk
subdirectory in the current working directory and then build OpenCilk
into the build
subdirectory of the current working directory:
git clone https://github.com/OpenCilk/infrastructure
infrastructure/tools/get -t <release_tag> $(pwd)/opencilk
infrastructure/tools/build $(pwd)/opencilk $(pwd)/build
For more instructions on building OpenCilk from source, see the Build OpenCilk from source guide.
To use OpenCilk to build and run Cilk programs, include the header file
cilk/cilk.h
in your program's source code, and then compile and link
your program as you would an ordinary C/C++ program using OpenCilk's
clang
or clang++
binary and the additional -fopencilk
flag.
For example, on Linux, the following command will build an optimized Cilk
executable fib
from fib.c
using OpenCilk, assuming that OpenCilk is
installed at /opt/opencilk-2
:
/opt/opencilk-2/bin/clang fib.c -o fib -O3 -fopencilk
On macOS, you will need XCode or Command Line Tools installed, to provide
the necessary system headers and libraries, and to preface your compile
and link commands with xcrun
:
xcrun /opt/opencilk-2/bin/clang fib.c -o fib -O3 -fopencilk
To run your Cilk program, simply run the resulting executable. For example:
./fib 40
You can specify the number of Cilk workers to use by setting the
CILK_NWORKERS
environment variable. For example, the following
command will run fib
using 4 Cilk worker threads:
CILK_NWORKERS=4 ./fib 40
Cilk extends C and C++ with a few keywords to expose logical parallelism in a program. These keywords create parallel subcomputations, or tasks, that are allowed to be scheduled and run simultaneously. OpenCilk's runtime system automatically schedules and load-balances the parallel tasks onto parallel processor cores in a shared-memory multicore using randomized work stealing.
The two most primitive Cilk keywords are cilk_spawn
and cilk_scope
.
A cilk_spawn
can be inserted before a function call to allow that call
to execute in parallel with its continuation. A cilk_scope
defines a
lexical scope in which all spawned subcomputations must complete before
program execution leaves the scope. Cilk supports recursive
spawning of tasks, in which a task may itself spawn and synchronize
subtasks.
For example, the following Cilk program shows how one can
parallelize the simple exponential-time algorithm to compute the nth
Fibonacci number using cilk_spawn
and cilk_scope
.
#include <cilk/cilk.h>
int fib(int n) {
if (n < 2)
return n;
int x, y;
cilk_scope {
x = cilk_spawn fib(n-1);
y = fib(n-2);
}
return x+y;
}
The return value of a cilk_spawn
is simply the return value of the
spawned function. Accessing the return value of a spawned function
before synchronizing that spawn results in a race.
One can also spawn functions that do not return a value, as in the following example:
#include <algorithm>
#include <cilk/cilk.h>
constexpr std::ptrdiff_t BASE_CASE_LENGTH = 32;
template <typename T> void sample_qsort(T* begin, T* end) {
if (end - begin < BASE_CASE_LENGTH) {
std::sort(begin, end); // Base case: Serial sort
} else {
--end; // Exclude last element (pivot) from partition
T* middle = std::partition(begin, end, [pivot=*end](T a) { return a < pivot; });
std::swap(*end, *middle); // Move pivot to middle
cilk_scope {
cilk_spawn sample_qsort(begin, middle);
sample_qsort(++middle, ++end); // Exclude pivot and restore end
}
}
}
[!NOTE] OpenCilk also continues to support the
cilk_sync
statement from previous versions of Cilk for synchronizing spawned tasks in a function without encapsulating those tasks in a lexical scope.[!NOTE] The OpenCilk runtime system assumes that all spawned children of any function are synchronized before the function returns. The
-fopencilk
flag ensures an implicit synchronization at the end of every function of that function's spawned children.
The cilk_for
keyword can be used to define a parallel loop, in which all
iterations of the loop are allowed to execute simultaneously. In Cilk,
cilk_for
loops are safe and efficient to nest, as the following example
shows:
#include <cilk/cilk.h>
void square_matmul(double *C, const double *A, const double *B, size_t n) {
cilk_for (size_t i = 0; i < n; ++i) {
cilk_for (size_t j = 0; j < n; ++j) {
C[i * n + j] = 0.0;
for (size_t k = 0; k < n; ++k) {
C[i * n + j] += A[i * n + k] * B[k * n + j];
}
}
}
}
Internally, the OpenCilk runtime system implements cilk_for
using
cilk_spawn
and cilk_scope
to spawn the cilk_for
loop iterations
efficiently using a parallel divide-and-conquer algorithm. This efficient
implementation of cilk_for
requires -O1
-level compiler optimizations or
higher.
The semantics of a Cilk program can often be understood based on its
serial projection, which is the serial program derived by transforming
the Cilk code to replace all of Cilk's task-parallel language constructs with
serial equivalents. Roughly speaking, one can derive the serial projection
of a Cilk program by replacing all cilk_for
s with ordinary for
s and
removing all other Cilk language constructs. The serial projection
corresponds with the execution of a Cilk program on a single worker, that is,
with CILK_NWORKERS=1
. If a Cilk program is deterministic, then all
parallel executions of a Cilk program have the same behavior as its serial
projection.
OpenCilk provides two Cilk-specific tools to check and analyze Cilk programs. The Cilksan race detector checks Cilk programs dynamically for determinacy races. The Cilkscale scalability analyzer measures a Cilk program's parallel scalability.
In addition, OpenCilk integrates standard tools packaged with LLVM for
analyzing C/C++ programs, including
Google's Sanitizers. You can use
those tools with Cilk programs in the same way that you use them for regular
C/C++ programs. For example, to check your Cilk program for memory errors
using AddressSanitizer, compile and link your Cilk program with
the additional -fsanitize=address
and then run it normally.
For a given Cilk program and input, Cilksan is guaranteed to either detect a determinacy race, if one exists, or certify that the program is determinacy-race free. Cilksan is therefore useful for debugging and regression-testing race bugs in Cilk programs.
For each race that Cilksan detects, it will produce a race report that includes the memory address being raced on and the call stacks of the two instructions involved in the race. Cilksan will avoid reporting races where both racing instructions are atomic operations or protected by a common lock.
To use Cilksan, compile and link the Cilk program with the additional
flag -fsanitize=cilk
, and then run it normally. It is also recommended
that you compile the Cilk program with debug symbols, by adding the -g
flag,
to improve the readability of any race reports.
As an example, here is a Cilksan race report from building and running the
nqueens
program in the
OpenCilk tutorial with Cilksan:
Race detected on location 1112ffd41
* Read 100ffeb84 nqueens nqueens.c:64:3
| `-to variable a (declared at nqueens.c:50)
+ Call 100fffb80 nqueens nqueens.c:70:31
+ Spawn 100ffec8c nqueens nqueens.c:70:31
|* Write 100ffed14 nqueens nqueens.c:68:12
|| `-to variable a (declared at nqueens.c:33)
\| Common calling context
+ Call 100fffb80 nqueens nqueens.c:70:31
+ Spawn 100ffec8c nqueens nqueens.c:70:31
+ Call 100fff428 main nqueens.c:103:9
Allocation context
Stack object a (declared at nqueens.c:33)
Alloc 100ffeb60 in nqueens nqueens.c:63:16
Call 100fffb80 nqueens nqueens.c:70:31
Spawn 100ffec8c nqueens nqueens.c:70:31
Call 100fff428 main nqueens.c:103:9
[!NOTE] Cilksan is compatible with compiler optimizations. Be advised, however, that compiler optimizations can affect debug symbols, which can in turn affect Cilksan's race reports.
In addition, the OpenCilk compiler can choose to optimize some parallel computation by serializing it, which may eliminate races in the original program. The OpenCilk compiler is not allowed to introduce new determinacy races into a program through optimizations.
The Cilkscale scalability analyzer measures the parallel scalability of a Cilk program. Cilkscale measures the parallel performance of a Cilk program in terms of work --- total computation --- and span --- length of a longest path of dependencies. Cilkscale uses these measures to evaluate the program's parallelism, which bounds the maximum possible parallel speedup the program can achieve on any number of parallel processors. Cilkscale also produces "burdened" span and parallelism measurements, which estimate the performance impact of scheduling overhead.
To use Cilkscale, compile and link the Cilk program with the additional
flag -fcilktool=cilkscale
, and then run the program normally.
By default, Cilkscale reports these measurements in CSV format. Here is an example of Cilkscale's output.
tag,work (seconds),span (seconds),parallelism,burdened_span (seconds),burdened_parallelism
,2.07768,0.195024,10.6535,0.195386,10.6337
You can redirect Cilkscale's output to a file by setting the
CILKSCALE_OUT
environment variable to that filename.
By default, Cilkscale measures the whole program execution. Cilkscale also
provides a library API, similar to clock_gettime()
, to measure specific
regions of the program. To measure a particular region in a Cilk program:
cilk/cilkscale.h
, in the source
program.wsp_getworkspan()
probe function around the region
of interest. For instance:
wsp_t start = wsp_getworkspan();
// Region to measure
wsp_t end = wsp_getworkspan();
wsp_sub()
method (or using the -
operator on the wsp_t
type in C++)
and the wsp_dump()
method. For example:
wsp_t elapsed = wsp_sub(end, start);
wsp_dump(dump, "my region tag");
The wsp_dump()
function will add a line to the CSV output for the measured
region, tagged with the tag string passed to wsp_dump()
. For example:
tag,work (seconds),span (seconds),parallelism,burdened_span (seconds),burdened_parallelism
my region tag,1.94387,0.0868964,22.37,0.0871339,22.309
,2.05014,0.19316,10.6137,0.193398,10.6006
Cilkscale can also be used to automatically benchmark a Cilk program on a range of processor counts and plot those performance results. For more information on Cilkscale's automatic benchmarking facility, see the Cilkscale user guide.
OpenCilk supports several advanced parallel-programming features, including reducer hyperobjects and deterministic parallel random-number generation.
OpenCilk supports reducer hyperobjects (or reducers for short) to coordinate parallel modifications to shared variables.
Reducers provide a flexible parallel reduction mechanism. When a Cilk program runs, the OpenCilk runtime system automatically creates new views of a reducer, each initialized to an identity value, and applies parallel modifications to the reducer to these independent views. As parallel subcomputations complete, the runtime system automatically combines these views in parallel using a binary reduction operator.
A Cilk reducer produces a deterministic result, regardless of how the program is scheduled at runtime, as long as its identity and reduction operator define a monoid. In particular, an associative reduction is all that's needed to obtain a deterministic result; the reduction need not be commutative.
With OpenCilk, you can define a variable to be a reducer by adding the
keyword cilk_reducer(I,R)
to its type, where I
identifies a function
that sets the identity value, and R
defines the binary reduction. For
example, the following code defines the sum
variable to be a reducer
by adding cilk_reducer(zero_i, plus_i)
to its type:
#include <cilk/cilk.h>
void zero_i(void *v) { *(int *)v = 0; }
void plus_i(void *l, void *r) { *(int *)l += *(int *)r; }
int sum_array(int *array, size_t n) {
int cilk_reducer(zero_i, plus_i) sum = 0;
cilk_for (size_t i = 0; i < n; ++i)
sum += array[i];
return sum;
}
In this example, the function zero_i
sets the identity value to be the
integer 0
, and plus_i
defines a binary reduction of adding two
integers.
OpenCilk supports deterministic parallel (pseudo)random-number generation. A deterministic parallel random-number generator (DPRNG) produces repeatable results across multiple executions of a Cilk program on a given input, regardless of parallel scheduling.
OpenCilk provides optimized support for a fast DPRNG. This fast DPRNG implements the DotMix algorithm, which produces 2-independent pseudorandom numbers. This fast DPRNG provides two functions:
__cilkrts_dprand_set_seed()
function seeds the DPRNG using a given
64-bit integer seed.__cilkrts_get_dprand()
function, which returns a 64-bit
pseudorandom value on each call.To use this fast DPRNG, include the cilk/cilk_api.h
header file and
link the program -lopencilk-pedigrees
.
For example, the following Cilk program uses this fast DPRNG to implement a parallel Monte Carlo algorithm for estimating pi:
#include <cstdint>
#include <limits>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>
template <typename T> void zero(void *v) {
*static_cast<T *>(v) = static_cast<T>(0);
}
template <typename T> void plus(void *l, void *r) {
*static_cast<T *>(l) += *static_cast<T *>(r);
}
double estimatePi(int64_t n) {
int64_t cilk_reducer(zero<int64_t>, plus<int64_t>) inside = 0;
cilk_for (int64_t i = 0; i < n; ++i) {
const double maxValue = static_cast<double>(std::numeric_limits<uint64_t>::max());
// Get two samples
uint64_t xSample = __cilkrts_get_dprand();
uint64_t ySample = __cilkrts_get_dprand();
double x = static_cast<double>(xSample) / maxValue;
double y = static_cast<double>(ySample) / maxValue;
double m = (x * x) + (y * y);
// Check if sample is inside of the circle
if (m <= 1)
++inside;
}
return 4.0 * static_cast<double>(inside) / static_cast<double>(n);
}
OpenCilk also supports the
pedigree runtime mechanism
for user-defined DPRNGs, using the same cilk/cilk_api.h
header and
-lopencilk-pedigrees
library. At any point in a Cilk program, the
__cilkrts_get_pedigree()
function returns the current pedigree in the
form of a singly linked list of __cilkrts_pedigree
nodes.
The OpenCilk system has three core components: a compiler, a runtime-system library, and a suite of Cilk tools.
The OpenCilk compiler (this repository) is based on the LLVM compiler infrastructure. The OpenCilk compiler extends LLVM with Tapir, a compiler IR for task parallelism that enables effective compiler analysis and optimization of task-parallel programs. Tapir provides a generic representation of task-parallel control flow that is independent of the Cilk language and the runtime implementation.
The OpenCilk runtime library is based on the Cheetah runtime system. This runtime system schedules and load-balances the Cilk computation using an efficient randomized work-stealing scheduler. The scheduler offers a mathematical guarantee to schedule efficiently on the available parallel processors on a shared-memory multicore. Furthermore, the OpenCilk runtime system ensures that this theoretical efficiency is borne out in practice.
The OpenCilk tool suite includes two tools for analyzing Cilk programs. The Cilksan race detector implements an extension of the SP-bags algorithm to check a Cilk program's execution on a given input for determinacy races. The Cilkscale scalability analyzer implements a parallel version of the Cilkview algorithm to analyze the parallel scalability of a Cilk program.
Although all OpenCilk components are integrated with each other, OpenCilk's system architecture aims to make it easy to modify and extend individual components. OpenCilk's tools use compiler-inserted instrumentation hooks that instrument LLVM's IR and Tapir instructions. Furthermore, the OpenCilk compiler implements a general Tapir-lowering infrastructure that makes use of LLVM bitcode — a binary representation of LLVM IR — to make it easy to compile Cilk programs to use different parallel runtime systems. For more information, see the OpenCilk paper.
For the OpenCilk system as a whole, cite the OpenCilk conference paper at ACM PPoPP 2023:
Tao B. Schardl and I-Ting Angelina Lee. 2023. OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '23). 189–203. https://doi.org/10.1145/3572848.3577509
BibTeX:
@inproceedings{SchardlLe23,
author = {Schardl, Tao B. and Lee, I-Ting Angelina},
title = {OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code},
year = {2023},
isbn = {9798400700156},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3572848.3577509},
doi = {10.1145/3572848.3577509},
booktitle = {Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming},
pages = {189–-203},
numpages = {15},
keywords = {bitcode, parallel runtime system, cilk, productivity tools, compiler-inserted instrumentation, tapir, compiling, task parallelism, fork-join parallelism, OpenCilk, oneTBB, OpenMP, parallel computing},
location = {Montreal, QC, Canada},
series = {PPoPP '23}
}
For the Tapir compiler IR, cite either the Tapir conference paper at ACM PPoPP 2017 conference paper or the Tapir journal paper in ACM TOPC 2019.
Tapir conference paper, ACM PPoPP 2017:
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). 249–265. https://doi.org/10.1145/3018743.3018758
BibTeX:
@inproceedings{SchardlMoLe17,
author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.},
title = {Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation},
year = {2017},
isbn = {9781450344937},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3018743.3018758},
doi = {10.1145/3018743.3018758},
booktitle = {Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {249–-265},
numpages = {17},
keywords = {control-flow graph, multicore, tapir, openmp, fork-join parallelism, cilk, optimization, serial semantics, llvm, par- allel computing, compiling},
location = {Austin, Texas, USA},
series = {PPoPP '17}
}
Journal article about Tapir, ACM TOPC 2019:
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2019. Tapir: Embedding Recursive Fork-join Parallelism into LLVM’s Intermediate Representation. ACM Transactions on Parallel Computing 6, 4, Article 19 (December 2019), 33 pages. https://doi.org/10.1145/3365655
BibTeX:
@article{SchardlMoLe19,
author = {Schardl, Tao B. and Moses, William S. and Leiserson, Charles E.},
title = {Tapir: Embedding Recursive Fork-Join Parallelism into LLVM’s Intermediate Representation},
year = {2019},
issue_date = {December 2019},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {6},
number = {4},
issn = {2329-4949},
url = {https://doi.org/10.1145/3365655},
doi = {10.1145/3365655},
journal = {ACM Transactions on Parallel Computing},
month = {dec},
articleno = {19},
numpages = {33},
keywords = {compiling, fork-join parallelism, Tapir, control-flow graph, optimization, parallel computing, OpenMP, multicore, Cilk, serial-projection property, LLVM}
}
Found a bug in OpenCilk? Please report it on the issue tracker.
Have a question or comment? Start a thread on the Discussions page or send us an email at contact@opencilk.org.
Want to contribute to the OpenCilk project? We welcome your contributions! Check out the contribute page on the OpenCilk website for more information.
OpenCilk is supported in part by the National Science Foundation, under grant number CCRI-1925609, and in part by the USAF-MIT AI Accelerator, which is sponsored by the United States Air Force Research Laboratory under Cooperative Agreement Number FA8750-19-2-1000.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and should not be interpreted as representing the official policies or views, either expressed or implied, of the United states Air Force, the U.S. Government, or the National Science Foundation. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.