KhronosGroup / SYCL-Docs

SYCL Open Source Specification
Other
109 stars 67 forks source link

`ND_range` where `ND` > 3 #139

Open TApplencourt opened 3 years ago

TApplencourt commented 3 years ago

Hi,

Graphics is 3D, but Science is ND! (one told me you should always start your issue with a catchy sentence...)

Two examples that I aware about :

To make SYCL appealing to scientists, it will be nice if nd_range can be a "real" nd_range ranging from 0 to N dimensions. For the spec point of view, It should be "trivial" to generalize the linearization formula of https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:multi-dim-linearization, with something like:

\sum_i^N {id}_{i} \prod_{j=i+1}^{N} r_j

In short, I would like to write something like:

   cgh.parallel_for(
       range<4> {total_sites,4,3,3}, [=](range<4> item) {
       }

(As said during the panel discussion this feature request should be a by-product of the extension of dimensionality of buffer to support also more than 3 dimension. I just put it here to keep a trace. Also possibly related to #107 )

Cheers, Thomas

tomdeakin commented 3 years ago

Thanks for opening the issue @TApplencourt.

fraggamuffin commented 3 years ago

internal issue on multi dim array

TApplencourt commented 3 years ago

Somes questions arose after today's Advisory Panel Quarterly Meeting. I put here my comment. They are now away, expert remark just my own thought. So please don't hesitate to disagree with me :)

  1. How should one map a range to the Backend API call?

The concern is that always linearizing the range may have a large performance impact.

This is an argument in favor of using the maximum backend dimension when "lowering" the sycl range. For example, decomposing a sycl::range<4>(i,j,k,l) in to a work_dim=2, global_work_size_vals[i*j,k*l] should be more efficient than doing work_dim=1, global_work_size_vals[i*j*k*l] (Obviously, if you don't want to spawn more thread than needed, this optimization is only available, when your number of range is a multiple of 2 or 3)

I will also note that this discussion about mapping to 1-3 Range is really GPU-centric. Other HW (CPU / FPGA, and else) doesn't have the same HW capability (or limitation depending of the point of view :) )

In conclusion, I think that this sycl::range -> backend mapping can be implemented efficiently-enough. Also as with every performance problem, I personally think that it's better if the implementer can work on developing a good algorithm for their platform than to let the burden of users re-inventing the wheel (reinventing the wheel badly. I know it's bad because I'm a user...)

  1. Semantic difference between range and nd_range?

Because one cannot have a barrier only between one dimension of the work-group, this makes the implementation easier. nd_range<2>( range<2>(x,y), range(i,j)) followed by a sycl::group_barrier(item.get_group()) is/should/(may?) be scementalic equivalent to nd_range<1>( range<1>(x*y), range(i*j)) followed by a sycl::group_barrier(item.get_group())

Member functions of SubGroups only support: range<1> or id<1> return types. So I don't think that the dimension of the nd_range will impact any of the semantics. But this is always more subtle than he looks like (hi memory coherency and forward progress guaranty), and I'm not an expert. I know @Pennycook worked a lot on this topic, so maybe he has more insight.

Hope this help,

PS: I just realized that one can grep GitHub ( https://grep.app/search?q=for%20collapse%5C%28%5Cd%5C%29&regexp=true) to get an example of OpenMP collapse kernel. For example, Kokos use collapse(7) (who doesn't like a prime number?)

#pragma omp target teams distribute parallel for collapse(7) map(to : functor)
    for (ptrdiff_t i0 = begin_0; i0 < end_0; i0++) {
      for (ptrdiff_t i1 = begin_1; i1 < end_1; i1++) {
        for (ptrdiff_t i2 = begin_2; i2 < end_2; i2++) {
          for (ptrdiff_t i3 = begin_3; i3 < end_3; i3++) {
            for (ptrdiff_t i4 = begin_4; i4 < end_4; i4++) {
              for (ptrdiff_t i5 = begin_5; i5 < end_5; i5++) {
                for (ptrdiff_t i6 = begin_6; i6 < end_6; i6++) {
                  if constexpr (std::is_same<typename Policy::work_tag,
                                             void>::value)
                    functor(i0, i1, i2, i3, i4, i5, i6);
                  else
                    functor(typename Policy::work_tag(), i0, i1, i2, i3, i4, i5,
                            i6);
                }
              }
            }
          }
        }
      }
Pennycook commented 3 years ago

Member functions of SubGroups only support: range<1> or id<1> return types. So I don't think that the dimension of the nd_range will impact any of the semantics.

But this is always more subtle than he looks like (hi memory coherency and forward progress guaranty), and I'm not an expert. I know @Pennycook worked a lot on this topic, so maybe he has more insight.

There's definitely subtlety here, but I don't think it's much worse than with the existing 2- and 3-dimensional ranges.

The relevant issue here is how a given work-group is divided into sub-groups, and the SYCL specification allows for many different implementations. A work-group of size {8, 4} executed on an NVIDIA GPU will (probably) map to a single warp, because CUDA wraps around when dividing CUDA blocks into warps. The same work-group executed on a CPU will (probably) map to a nested loop where only the inner-most loop is vectorized, i.e.:

for (int i = 0; i < 8; ++i) { // sequential loop within a hardware thread
  for (int j = 0; j < 4: ++j) { // vectorized loop
    // execute user's kernel functor
  }
}

...although the reality of things is much more complicated than this due to things like how barriers are implemented, this isn't a terrible mental model.

Understanding how the work-group gets broken into sub-groups is important when calling things like sub-group collective functions (e.g. reductions) as well as for the sorts of complex forward progress guarantee issues you mentioned -- if you don't know which work-items are doing what work, you can't rely on something like a reduction to give you the result that you expect.

The best way to write a portable kernel using sub-groups today is to make sure that the inner-most dimension is a multiple of the sub-group. Then most of the problems go away -- although nothing is guaranteed by the standard, all implementations I'm aware of will handle this case in the same way. I think that advice would extend to these higher-dimensionality kernels, but it's worth noting that might place limits on the work-group sizes you could realistically use: for example, if your maximum work-group size was 256 and you needed your inner-most dimension to be divisible by 32, a 7-dimensional work-group is probably going to end up with a size like {1, 1, 1, 2, 2, 2, 32}.

fraggamuffin commented 2 years ago

move to PR by TD