OpenCilk / opencilk-project

Monorepo for the OpenCilk compiler. Forked from llvm/llvm-project and based on Tapir/LLVM.
Other
91 stars 29 forks source link

Runtime grain size support #54

Closed qtt2 closed 2 years ago

qtt2 commented 3 years ago

Hi,

I'm trying to tune grain size in my application that uses cilk_for. Is there a reason why the grain size in #pragma cilk grainsize(grain_size) needs to be a constant expression? As far as I know, previous implementation of cilkplus by Intel did allow the expression to be evaluated at runtime.

Is there a way to set the grain size at runtime? Thank you very much!

Tuan

neboat commented 3 years ago

Hey Tuan,

Essentially, OpenCilk's grainsize pragma requires constant grainsizes because this restriction simplifies the OpenCilk design and implementation while simultaneously accommodating and encouraging reasonable uses of the grainsize pragma, specifically, to reduce scheduling overheads.

Here's some more context behind the grainsize pragma.

Cilk programs achieve good parallel speedup when they expose ample parallelism to the runtime system. (Typically, one wants the program's parallelism to exceed the number of Cilk workers by roughly a factor of 10.) Intuitively, that means that Cilk programs should aim to expose as much parallel work as possible.

Of course, there are practical reasons why one doesn't always want to expose all of the logical parallelism in a program. In the case of a cilk_for loop, there is a nonnegligible cost to spawning a loop iteration, which means cilk_for loops could incur high scheduling overheads if they always spawned off all loop iterations. (In addition, spawning every loop iteration can inhibit compiler optimizations, such as vectorization.) This is where the grainsize pragma come in.

To amortize the constant overhead of spawning a parallel loop iteration, the grainsize pragma allows the programmer to easily specify a number of iterations that will execute serially per spawned loop iteration. As a result, the overhead of spawning a parallel-loop iteration can be amortized against the time to execute multiple iterations of the serial loop. Because the overhead of spawning a loop iteration is constant, for any given loop, a constant-value grainsize is sufficient to amortize that cost.

(Aside: The OpenCilk compiler and runtime will both try to determine an appropriate grainsize for any cilk_for loop that doesn't have a user-defined grainsize specified by the pragma. This mechanism works well for many loops, but because it can't always handle all loops perfectly, we allow users to specify a grainsize when needed.)

One not-so-great use of the grainsize pragma is to specify a grainsize of n/P, where n is the number of loop iterations and P is the number of Cilk workers. Although many programmers will naturally think to use such a grainsize, this grainsize tends not to work well in the Cilk model. Intuitively, by using this grainsize, the parallelism of the resulting loop becomes P, the number of workers. In contrast, the Cilk runtime works best when the parallelism significantly exceeds the number of workers, e.g., by a factor of 10.

All that being said, suppose you need to implement a cilk_for loop with a nonconstant grainsize for some reason.

  1. We'd like to hear about your use case!
  2. Its worth noting that implementing a particular grainsize for a cilk_for loop is equivalent to stripmining the loop. For example, consider the following cilk_for loop:
    cilk_for (int i = 0; i < n; ++i) {
    loop_body(i);
    }

    To stripmine the loop to use an arbitrary grainsize G, one can manually rewrite the loop as follows:

    #pragma cilk grainsize 1
    cilk_for (int ii = 0; ii < n; ii += G) {
    for (int i = ii; i < min(ii + G, n); ++i) {
    loop_body(i);
    }
    }

    (Note: There are a couple of different ways to write this stripmined loop, and those different implementations can have different performance. In this example, I'm using a compact implementation, which you will hopefully find easier to work with.)

Hope that helps. Let me know if you have more questions.

Cheers, TB

qtt2 commented 3 years ago

Thanks for your detailed answer!

I'm trying to explore OpenCilk to manage parallel tasks in a heterogeneous multi-core system (e.g., big.LITTLE). Since not all cores perform at the same pace, a grain size picked by the runtime may not work best for that kind of systems. As the first step, I'm trying to sweep through multiple grain sizes to see which one would yield the optimal balance between parallelism and runtime overheads.

Can you elaborate a bit more about how OpenCilk runtime picks the grain size? Is it always N/(KP) where K is a constant from empirical experiments? Does the runtime consider asymmetric systems?

I will try your suggestion of strip-mining the loop. Thanks!

cleiserson commented 3 years ago

TB,

That could be a blog post when we’re set up!

Cheers, Charles

On Thu, May 20, 2021 at 19:44 Tao B. Schardl @.***> wrote:

Hey Tuan,

Essentially, OpenCilk's grainsize pragma requires constant grainsizes because this restriction simplifies the OpenCilk design and implementation while simultaneously accommodating and encouraging reasonable uses of the grainsize pragma, specifically, to reduce scheduling overheads.

Here's some more context behind the grainsize pragma.

Cilk programs achieve good parallel speedup when they expose ample parallelism to the runtime system. (Typically, one wants the program's parallelism to exceed the number of Cilk workers by roughly a factor of 10.) Intuitively, that means that Cilk programs should aim to expose as much parallel work as possible.

Of course, there are practical reasons why one doesn't always want to expose all of the logical parallelism in a program. In the case of a cilk_for loop, there is a nonnegligible cost to spawning a loop iteration, which means cilk_for loops could incur high scheduling overheads if they always spawned off all loop iterations. (In addition, spawning every loop iteration can inhibit compiler optimizations, such as vectorization.) This is where the grainsize pragma come in.

To amortize the constant overhead of spawning a parallel loop iteration, the grainsize pragma allows the programmer to easily specify a number of iterations that will execute serially per spawned loop iteration. As a result, the overhead of spawning a parallel-loop iteration can be amortized against the time to execute multiple iterations of the serial loop. Because the overhead of spawning a loop iteration is constant, for any given loop, a constant-value grainsize is sufficient to amortize that cost.

(Aside: The OpenCilk compiler and runtime will both try to determine an appropriate grainsize for any cilk_for loop that doesn't have a user-defined grainsize specified by the pragma. This mechanism works well for many loops, but because it can't always handle all loops perfectly, we allow users to specify a grainsize when needed.)

One not-so-great use of the grainsize pragma is to specify a grainsize of n/P, where n is the number of loop iterations and P is the number of Cilk workers. Although many programmers will naturally think to use such a grainsize, this grainsize tends not to work well in the Cilk model. Intuitively, by using this grainsize, the parallelism of the resulting loop becomes P, the number of workers. In contrast, the Cilk runtime works best when the parallelism significantly exceeds the number of workers, e.g., by a factor of 10.

All that being said, suppose you need to implement a cilk_for loop with a nonconstant grainsize for some reason.

  1. We'd like to hear about your use case!
  2. Its worth noting that implementing a particular grainsize for a cilk_for loop is equivalent to stripmining the loop. For example, consider the following cilk_for loop:

cilk_for (int i = 0; i < n; ++i) { loop_body(i); }

To stripmine the loop to use an arbitrary grainsize G, one can manually rewrite the loop as follows:

cilk_for (int ii = 0; ii < n; ii += G) { for (int i = ii; i < min(ii + G, n); ++i) { loop_body(i); } }

(Note: There are a couple of different ways to write this stripmined loop, and those different implementations can have different performance. In this example, I'm using a compact implementation, which you will hopefully find easier to work with.)

Hope that helps. Let me know if you have more questions.

Cheers, TB

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenCilk/opencilk-project/issues/54#issuecomment-845552829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA45BONXHKEEJCWUNLCUCGLTOWNDFANCNFSM45H5RJNA .

-- Sent from my snart phone. ;^)

neboat commented 3 years ago

Hey Tuan,

For a cilk_for loop of n iterations with no specified grainsize, the OpenCilk runtime system uses a grainsize of min{2048, n/8P}, where P is the number of Cilk workers. Therefore, this grainsize is equal to n/8P for small loops (that is, when n is small) and 2048 for large loops.

Before this runtime-grainsize computation kicks in, the OpenCilk compiler will statically analyze cilk_for loops and try to determine an appropriate constant grainsize for each loop. Essentially, for each cilk_for loop, the OpenCilk compiler will attempt to estimate the work in the body of the loop and compare that estimate to its estimate of the cost to spawn a parallel-loop iteration. Based on these estimates, the compiler will choose a power-of-2 grainsize for the loop that is at most 2048. If the compiler succeeds in this analysis, then it will stripmine the loop using the constant grainsize it has chosen. The compiler might fail in its analysis, however, if there is something in the loop body it can't analyze, such as an opaque function call. In such a case, the compiler allows the runtime system to choose a grainsize for the loop using the aforementioned formula.

Of course, if the user specifies a grainsize for the loop, then the OpenCilk compiler and runtime will use that user-specified grainsize instead.

Hope that clarifies things. Let me know if you have more questions.

Cheers, TB