dsharlet / slinky

Optimize pipelines for locality
MIT License
8 stars 2 forks source link

Slinky

This project aims to provide a lightweight runtime to semi-automatically optimize data flow pipelines for locality. Pipelines are specified as graphs of operators processing data between multi-dimensional buffers. Slinky then allows the user to describe transformations to the pipeline that improve memory locality and reduce memory usage by executing small crops of operator outputs.

Slinky is heavily inspired and motivated by Halide. It can be described by starting with Halide, and making the following changes:

Because we are not responsible for generating the inner loop code like Halide, scheduling is a dramatically simpler problem. Without needing to worry about instruction selection, register pressure, and so on, the cost function for scheduling is a much more straightforward function of high level memory access patterns.

The ultimate goal of Slinky is to make automatic scheduling of pipelines reliable and fast enough to implement a just-in-time optimization engine at runtime.

Graph description

Pipelines are described by operators called funcs and connected by buffer_exprs. func has a list of input and output objects. A func can have multiple outputs, but all outputs must be indexed by one set of dimensions for the func. An input or output is associated with a buffer_expr. An output has a list of dimensions, which identify variables (var) used to index the corresponding dimension of the buffer_expr. An input has a list of bounds expressions, expressed as an inclusive interval [min, max], where the bounds can depend on the variables from the output dimensions.

The actual implementation of a func is a callback taking a single argument eval_context. This object contains the state of the program at the time of the call. Values of any symbol currently in scope at the time of the call can be accessed in the eval_context.

Elementwise example

Here is an example of a simple pipeline of two 1D elementwise funcs:

node_context ctx;

auto in = buffer_expr::make(ctx, "in", sizeof(int), 1);
auto out = buffer_expr::make(ctx, "out", sizeof(int), 1);
auto intm = buffer_expr::make(ctx, "intm", sizeof(int), 1);

var x(ctx, "x");

func mul = func::make(multiply_2, {in, {point(x)}}, {intm, {x}});
func add = func::make(add_1, {intm, {point(x)}}, {out, {x}});

pipeline p = build_pipeline(ctx, {in}, {out});

The possible implementations of this pipeline vary between two extremes:

  1. Allocating intm to have the same size as out, and executing all of mul, followed by all of add.
  2. Allocating intm to have a single element, and executing mul followed by add in a single loop over the output elements.

Of course, (2) would have extremely high overhead, and would not be a desireable strategy. If the buffers are large, (1) is inefficient due to poor memory locality. The ideal strategy is to split out into chunks, and compute the two operations at each chunk. This allows targeting a chunk size that fits in the cache, but amortizes the overhead of setting up the buffers and calling the functions implementing this operation. This can be implemented with the following schedule:

const int chunk_size = 8;
add.loops({x, chunk_size});
mul.compute_at({&stencil, x});

In this case, the mul.compute_at specification is only for illustration purposes, it is equivalent to the default behavior, which is to compute funcs at the innermost location that does not imply redundant compute.

This will result in a slinky program that looks like this:

intm = allocate(heap, 4, {
  {[buffer_min(out, 0), buffer_max(out, 0)], 4, 8}
}) {
 intm.uncropped = clone_buffer(intm) {
  serial loop(x in [buffer_min(out, 0), buffer_max(out, 0)], 8) {
   crop_dim(intm, 0, [x, min((x + 7), buffer_max(out, 0))]) {
    call(add, {in}, {intm})
   }
   crop_dim(out, 0, [x, (x + 7)]) {
    call(mul, {intm.uncropped}, {out})
   }
  }
 }
}

Observations:

Stencil example

Here is an example of a pipeline that has a stage that is a stencil, such as a convolution:

node_context ctx;

auto in = buffer_expr::make(ctx, "in", sizeof(short), 2);
auto out = buffer_expr::make(ctx, "out", sizeof(short), 2);

auto intm = buffer_expr::make(ctx, "intm", sizeof(short), 2);

var x(ctx, "x");
var y(ctx, "y");

func add = func::make(add_1, {in, {point(x), point(y)}}, {intm, {x, y}});
func stencil =
    func::make(sum3x3, {intm, {{x - 1, x + 1}, {y - 1, y + 1}}}, {out, {x, y}});

pipeline p = build_pipeline(ctx, {in}, {out});

The typical way that many systems would execute such a pipeline is to run all of add_1, followed by all of sum3x3, storing the intermediate result in a buffer equal to the size of the input. Here is a visualization of this strategy.

An interesting way to implement this pipeline is to compute rows of out at a time, keeping the window of rows required from add in memory. This can be expressed with the following schedule:

stencil.loops({y});
add.compute_at({&stencil, y});

This means:

This generates a program like so:

intm = allocate<intm>({
  {[(buffer_min(out, 0) + -1), (buffer_max(out, 0) + 1)], 2},
  {[(buffer_min(out, 1) + -1), (buffer_max(out, 1) + 1)], ((buffer_extent(out, 0) * 2) + 4), 3}
} on heap) {
 loop(y in [(buffer_min(out, 1) + -2), buffer_max(out, 1)]) {
   crop_dim<1>(intm, [(y + 1), (y + 1)]) {
   call(add, {in}, {intm})
  }
  crop_dim<1>(out, [y, y]) {
   call(sum3x3, {intm}, {out})
  }
 }
}

This program does the following:

Here is a visualization of this strategy.

Matrix multiply example

Here is a more involved example, which computes the matrix product d = (a x b) x c:

node_context ctx;

auto a = buffer_expr::make(ctx, "a", sizeof(float), 2);
auto b = buffer_expr::make(ctx, "b", sizeof(float), 2);
auto c = buffer_expr::make(ctx, "c", sizeof(float), 2);
auto abc = buffer_expr::make(ctx, "abc", sizeof(float), 2);

auto ab = buffer_expr::make(ctx, "ab", sizeof(float), 2);

var i(ctx, "i");
var j(ctx, "j");

// The bounds required of the dimensions consumed by the reduction depend on the size of the
// buffers passed in. Note that we haven't used any constants yet.
auto K_ab = a->dim(1).bounds;
auto K_abc = c->dim(0).bounds;

// We use float for this pipeline so we can test for correctness exactly.
func matmul_ab =
    func::make<const float, const float, float>(matmul<float>, {a, {point(i), K_ab}}, {b, {K_ab, point(j)}}, {ab, {i, j}});
func matmul_abc = func::make<const float, const float, float>(
    matmul<float>, {ab, {point(i), K_abc}}, {c, {K_abc, point(j)}}, {abc, {i, j}});

pipeline p = build_pipeline(ctx, {a, b, c}, {abc});

Much like the elementwise example, we can compute this in a variety of ways between two extremes:

  1. Allocating ab to have the full extent of the product a x b, and executing all of the first multiply followed by all of the second multiply.
  2. Each row of the second product depends only on the same row of the first product. Therefore, we can allocate ab to hold only one row of the product a x b, and compute both products in a loop over rows of the final result.

In practice, matrix multiplication kernels like to produce multiple rows at once for maximum efficiency.

Where this helps

We expect this approach to fill a gap between two extremes that seem prevalent today (TODO: is this really true? I think so...):

  1. Pipeline interpreters that execute entire operations one at a time.
  2. Pipeline compilers that generate code specific to a pipeline.

We expect Slinky to execute suitable pipelines using less memory than (1), but at a lower performance than what is possible with (2). We emphasize possible because actually building a compiler that does this well on novel code is very difficult. We think Slinky's approach is a more easily solved problem, and will degrade more gracefully in failure cases.

Data we have so far

This performance app attempts to measure the overhead of interpreting pipelines at runtime. The test performs a copy between two 2D buffers of "total size" bytes twice: first to an intermediate buffer, and then to the output. The inner dimension of size "copy size" is copied with memcpy, the outer dimension is a loop implemented in one of two ways:

  1. An "explicit loop" version, which has a loop in the pipeline for the outer dimension (interpreted by Slinky).
  2. An "implicit loop" version, which loops over the outer dimension in the callback.

Two factors affect the performance of this pipeline:

This is an extreme example, where memcpy is the fastest operation (per memory accessed) that could be performed in a pipeline. In other words, this is an upper bound on the overhead that could be expected for an operation on the same amount of memory.

On my machine, here are some data points from this pipeline:

32 KB

copy size (KB) loop (GB/s) no loop (GB/s) ratio
1 13.0713 19.7616 0.661
2 19.485 23.728 0.821
4 24.254 25.221 0.962
8 27.701 26.013 1.065
16 26.428 25.919 1.020
32 25.891 26.494 0.977

128 KB

copy size (KB) loop (GB/s) no loop (GB/s) ratio
1 12.947 21.410 0.605
2 20.459 25.705 0.796
4 25.456 27.320 0.932
8 30.462 27.514 1.107
16 28.804 27.578 1.044
32 28.480 28.026 1.016

512 KB

copy size (KB) loop (GB/s) no loop (GB/s) ratio
1 12.416 20.683 0.600
2 19.230 24.026 0.800
4 23.793 24.163 0.985
8 27.807 24.075 1.155
16 27.173 24.201 1.123
32 26.199 24.155 1.085

2048 KB

copy size (KB) loop (GB/s) no loop (GB/s) ratio
1 12.229 20.616 0.593
2 19.833 24.447 0.811
4 24.303 24.761 0.982
8 28.563 24.262 1.177
16 27.951 24.104 1.160
32 26.826 24.217 1.108

8192 KB

copy size (KB) loop (GB/s) no loop (GB/s) ratio
1 11.978 12.023 0.996
2 19.676 16.441 1.197
4 21.588 14.013 1.541
8 23.544 14.536 1.620
16 23.892 13.440 1.778
32 23.965 13.942 1.719

Observations

As we should expect, the observations vary depending on the total size of the copy: