`ND_range` where `ND` > 3

TApplencourt commented 3 years ago

Hi,

Graphics is 3D, but Science is ND! (one told me you should always start your issue with a catchy sentence...)

Two examples that I aware about :

the naive Coupled-Cluster algorithm is written as a ~7D loop nests,

The famous su3bench algorithm is a ~4D loop nest.

The OpenMP version of su3bench look like that:

 #pragma omp target teams distribute
  for(int i=0;i<total_sites;++i) {
  #pragma omp parallel for collapse(3)
  for (int j=0; j<4; ++j) {
    for(int k=0;k<3;k++) {
      for(int l=0;l<3;l++){

where the SYCL version looks like this monstrosity:

   size_t total_wi = total_sites * THREADS_PER_SITE;
     cgh.parallel_for(
   nd_range<1> {total_wi, wgsize}, [=](nd_item<1> item) {
    size_t myThread = item.get_global_id(0);
    size_t mySite = myThread/36;
    if (mySite < total_sites) {
      int j = (myThread%36)/9;
      int k = (myThread%9)/3;
      int l = myThread%3;

The SYCL code just linearises the loop-nest and then manually computes the indexes. This is totally mechanical but makes the code quite ugly and hard to parse.

To make SYCL appealing to scientists, it will be nice if nd_range can be a "real" nd_range ranging from 0 to N dimensions. For the spec point of view, It should be "trivial" to generalize the linearization formula of https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:multi-dim-linearization, with something like:

\sum_i^N {id}_{i} \prod_{j=i+1}^{N} r_j

In short, I would like to write something like:

   cgh.parallel_for(
       range<4> {total_sites,4,3,3}, [=](range<4> item) {
       }

(As said during the panel discussion this feature request should be a by-product of the extension of dimensionality of buffer to support also more than 3 dimension. I just put it here to keep a trace. Also possibly related to #107 )

Cheers, Thomas

tomdeakin commented 3 years ago

Thanks for opening the issue @TApplencourt.

fraggamuffin commented 3 years ago

internal issue on multi dim array

TApplencourt commented 3 years ago

Somes questions arose after today's Advisory Panel Quarterly Meeting. I put here my comment. They are now away, expert remark just my own thought. So please don't hesitate to disagree with me :)

How should one map a range to the Backend API call?

The concern is that always linearizing the range may have a large performance impact.

Indeed, GPU Backend has some specific build-in to handle up to 3D range. For example, the computation of some index is done via hardware call (for example see Intel implementation of get_enqueued_local_size how to call an HW built-in).

This is an argument in favor of using the maximum backend dimension when "lowering" the sycl range. For example, decomposing a sycl::range<4>(i,j,k,l) in to a work_dim=2, global_work_size_vals[i*j,k*l] should be more efficient than doing work_dim=1, global_work_size_vals[i*j*k*l] (Obviously, if you don't want to spawn more thread than needed, this optimization is only available, when your number of range is a multiple of 2 or 3)

An Argument against the idea that this linearization will impact performance is that always linearizing is what OpenMP does (at least Intel ICPX compiler, and the CUDA-OpenMP runtime of LLVM).
Also, I'm sure that a lot of work has been done on the fast-linearization algorithm, the number of ranges is known at compile-time, and all the computations are integer arithmetic so this should help to limit the impact of the index computation

I will also note that this discussion about mapping to 1-3 Range is really GPU-centric. Other HW (CPU / FPGA, and else) doesn't have the same HW capability (or limitation depending of the point of view :) )

In conclusion, I think that this sycl::range -> backend mapping can be implemented efficiently-enough. Also as with every performance problem, I personally think that it's better if the implementer can work on developing a good algorithm for their platform than to let the burden of users re-inventing the wheel (reinventing the wheel badly. I know it's bad because I'm a user...)

Semantic difference between range and nd_range?

nd_range allows barriers between all the work-item in the work-group.

Because one cannot have a barrier only between one dimension of the work-group, this makes the implementation easier. nd_range<2>( range<2>(x,y), range(i,j)) followed by a sycl::group_barrier(item.get_group()) is/should/(may?) be scementalic equivalent to nd_range<1>( range<1>(x*y), range(i*j)) followed by a sycl::group_barrier(item.get_group())

nd_range also subgroups.

Member functions of SubGroups only support: range<1> or id<1> return types. So I don't think that the dimension of the nd_range will impact any of the semantics. But this is always more subtle than he looks like (hi memory coherency and forward progress guaranty), and I'm not an expert. I know @Pennycook worked a lot on this topic, so maybe he has more insight.

Hope this help,

PS: I just realized that one can grep GitHub ( https://grep.app/search?q=for%20collapse%5C%28%5Cd%5C%29&regexp=true) to get an example of OpenMP collapse kernel. For example, Kokos use collapse(7) (who doesn't like a prime number?)

#pragma omp target teams distribute parallel for collapse(7) map(to : functor)
    for (ptrdiff_t i0 = begin_0; i0 < end_0; i0++) {
      for (ptrdiff_t i1 = begin_1; i1 < end_1; i1++) {
        for (ptrdiff_t i2 = begin_2; i2 < end_2; i2++) {
          for (ptrdiff_t i3 = begin_3; i3 < end_3; i3++) {
            for (ptrdiff_t i4 = begin_4; i4 < end_4; i4++) {
              for (ptrdiff_t i5 = begin_5; i5 < end_5; i5++) {
                for (ptrdiff_t i6 = begin_6; i6 < end_6; i6++) {
                  if constexpr (std::is_same<typename Policy::work_tag,
                                             void>::value)
                    functor(i0, i1, i2, i3, i4, i5, i6);
                  else
                    functor(typename Policy::work_tag(), i0, i1, i2, i3, i4, i5,
                            i6);
                }
              }
            }
          }
        }
      }

Pennycook commented 3 years ago

Member functions of SubGroups only support: range<1> or id<1> return types. So I don't think that the dimension of the nd_range will impact any of the semantics.

But this is always more subtle than he looks like (hi memory coherency and forward progress guaranty), and I'm not an expert. I know @Pennycook worked a lot on this topic, so maybe he has more insight.

There's definitely subtlety here, but I don't think it's much worse than with the existing 2- and 3-dimensional ranges.

The relevant issue here is how a given work-group is divided into sub-groups, and the SYCL specification allows for many different implementations. A work-group of size {8, 4} executed on an NVIDIA GPU will (probably) map to a single warp, because CUDA wraps around when dividing CUDA blocks into warps. The same work-group executed on a CPU will (probably) map to a nested loop where only the inner-most loop is vectorized, i.e.:

for (int i = 0; i < 8; ++i) { // sequential loop within a hardware thread
  for (int j = 0; j < 4: ++j) { // vectorized loop
    // execute user's kernel functor
  }
}

...although the reality of things is much more complicated than this due to things like how barriers are implemented, this isn't a terrible mental model.

Understanding how the work-group gets broken into sub-groups is important when calling things like sub-group collective functions (e.g. reductions) as well as for the sorts of complex forward progress guarantee issues you mentioned -- if you don't know which work-items are doing what work, you can't rely on something like a reduction to give you the result that you expect.

The best way to write a portable kernel using sub-groups today is to make sure that the inner-most dimension is a multiple of the sub-group. Then most of the problems go away -- although nothing is guaranteed by the standard, all implementations I'm aware of will handle this case in the same way. I think that advice would extend to these higher-dimensionality kernels, but it's worth noting that might place limits on the work-group sizes you could realistically use: for example, if your maximum work-group size was 256 and you needed your inner-most dimension to be divisible by 32, a 7-dimensional work-group is probably going to end up with a size like {1, 1, 1, 2, 2, 2, 32}.

fraggamuffin commented 2 years ago

move to PR by TD

KhronosGroup / SYCL-Docs

`ND_range` where `ND` > 3 #139