StanfordLegion / regent-lang.org

Regent website
Apache License 2.0
1 stars 3 forks source link

Using "by field" partitions #2

Open LonelyCat124 opened 4 years ago

LonelyCat124 commented 4 years ago

The reference describes that these "by field" partitions can be created:

var p = partition(r.color_field, color_space)

Right now, it doesn't discuss how one should use them, i.e. something like:

for s in color_space do
  task(p[s])

or

-- task to only do on fields of value 0
task(p[0]) 

Also relatedly, I'm not sure if the resultant partition is structured or unstructured. This code:

  for s in f[0].ispace do
    c.printf("%i %i\n", s.x, s.y)
  end

suggests that its a int2d index space, but I'm unclear if you could make a similar index space in some other way? For example to make a halo of the index space created by the by field partition?

elliottslaughter commented 4 years ago

A partition by field is like any other partition. Both of your middle two examples should work.

Partitioning (by field or otherwise) doesn't change the type of index space you get, so it'll be 2D if the region is 2D, etc.

Structured vs unstructured shouldn't matter any more, that's an old distinction that doesn't really exist at this point. (There is still a ptr type, but it's nearly identical to int1d for all practical purposes.) It only really matters in how you do the indexing (i.e., is i+1 a meaningful operation or not).

The main issue with a halo partition is that it is (presumably) going to be aliased. You can't create an aliased partition via "partition by field" since by definition all such partitions are disjoint. There are other ways to do it, though it depends on (again) exactly how you're doing the indexing.

Some of the general concepts should be covered in the Legion bootcamps, in case that would be helpful: https://legion.stanford.edu/bootcamp2017/

LonelyCat124 commented 4 years ago

So the case in question is a stencil code with operations that don't operate on certain cells dependent on their neighbours "type" (which is either boundary land, boundary open or ocean for this case, and at least at the moment is initialised and never changes).

The minicase has either boundary or non-boundaries and the way I can think of to do this is: 1) In the fspace I created an extra flag, and precompute which are involved in each kernel, and store these in some field, and store which are adjacent to (i.e. halo points) 2) Partition the space with an equal NxN partition, compute a halo image partition on the equal partition, and then compute by field partitions for the fields required too.

3) For each of the NxN equal partitions, compute the intersection of them with the appropriate field partition, and the same for the halo, e.g.:

  for color in partitioned_solution.colors do
    var partition_region = partitioned_solution[color] & f
    var partition_halo = partitioned_halos[color] & halo
   task( partition_region, partition_halo )
  end

This should suffice mostly, however because of the repeated creation of partition_region and partition_halo would prevent tracing.

I assume the strategy would then be store these partitions in some data structure and reuse them to allow tracing? I just can't work out how one could make an array of &region (or any other related region thing) in terra. Do I use legion_logical_region_t or similar?

elliottslaughter commented 4 years ago

So assuming I understand correctly, you can do this with cross-product partition... which I see is not documented. I should fix that. In the mean time, you can see an example here:

https://github.com/StanfordLegion/legion/blob/master/language/tests/regent/run_pass/region_partition_cross_product.rg#L46

Having said that, do you actually need this partition? You could build the halos without considering the node type and just skip over it if you don't need it. I think the only issue would be you might communicate more than you need to, but assuming you have different stencils for different kinds of elements, you're not actually wasting anything anyway...

Let me know if I've misunderstood your problem.

LonelyCat124 commented 4 years ago

I'll have a look into the cross product now, thanks.

For the halos I don't need these partition (since the large region is sufficient ignoring communication, in the real cases the "ocean" subregion is quite a dense portion of the overall space, so the halo region is probably still the entire domain), but for the updated regions I would, at current the regent code for one task looks like:

    for point in velocity do
     if ( (grid_region[point].tmask + grid_region[point+{1,0}].tmask > int1d(0)) and (grid_region[point].tmask == int1d(1) and grid_region[point + {1,0}].tmask == int1d(1)) ) then
  #Large stencil computation without writes to regions
    velocity[point].u_after = #update region from temporaries
    end
  end

The reason I want to create subregions for the different stencils is so I can do a __demand(__vectorize) (or potentially, __demand(__cuda)) for this loop, which currently is forbidden due to the loop-dependent if statement (whereas the OpenMP and/or MPI code can to mask out the final update on mask-compatible processors such as AVX512).

Edit: Cross product should do exactly what I want to do (to cover that if statement).

Also regarding the documentation the reference discusses the standard library but I've not been able to successfully use it - is there some sort of import or require that is needed? As far as I know I have an up to date master (at least the last couple of days)

LonelyCat124 commented 4 years ago

Edit: Cross product of different regions works exactly how I think (that is you get the cross product of the index space, applied to the first region)

elliottslaughter commented 4 years ago

You shouldn't need any special import, it's built into the language (not a library).

elliottslaughter commented 4 years ago

Aside from the documentation of cross product, is there anything else unresolved related to this issue?

LonelyCat124 commented 4 years ago

No - I think the std.format documentation also could do with mentioning local format = require("std/format") (or similar) is needed - without that line i couldn't use format.println (or similar)

LonelyCat124 commented 4 years ago

One other question about cross product - does it prevent using __demand(__index_launch)?

I'm trying:

  var partitioned_2N2M1_velocity = partition(equal, _2N2M1_velocity[int2d({0,0})], partition_space)
  var loop_vel_ufield_partition = partition( loop_conditions_data.compute_vel_ufield, ispace(int1d, 1 ) )
  var vel_ufield_partitions = cross_product(partitioned_2N2M1_velocity, loop_vel_ufield_partition)

     __demand(__trace, __index_launch)
     for part in partition_space do
       update_velocity_ufield(vel_ufield_partitions[part][0], ...)
    end

where, partitioned_2N2M1_velocity and loop_vel_ufield_partition are on different field spaces. When I just use the partitioned_2N2M1_velocity it was fine - but now I get

algorithm.rg:244: loop optimization failed: argument 1 is not provably projectable or invariant
       update_velocity_ufield(vel_ufield_partitions[part][0],

Without index launching the code runs correctly. If I'm tracing I'm not sure how valuable the index launch is anyway, but wasn't expecting this to block it.

elliottslaughter commented 4 years ago

I think you'd need this feature for the loop you're trying to write: https://github.com/StanfordLegion/legion/issues/845

Depending on which one of (1-3) you need in that issue I may be able to put in the support to do this fairly quickly.

How far do you need to scale? I do expect index launches to be important for scaling, even with tracing.

LonelyCat124 commented 4 years ago

I think its the second case - I will always be using the result of a cross product of an equal NxM partition with a 1-partition (by field), so both are disjoint and the result should therefore be a {NxM} x 1 disjoint partition, and the loop would be over the {NxM}. Maybe its the first? I'm not clear.

At the moment I'm focusing on single-node performance (32 cores, Broadwell & Skylake, the goal is to match the Fortran OpenMP performance), and once I'm happy with that, the short term goal is to look at multi-node performance.

elliottslaughter commented 4 years ago

So using a partition in this way will always involve a certain amount of branching; it's just happening inside of the compiler instead of in user code. It's not too difficult to believe that branching inside of the loop is better than branching outside of the loop (i.e. that's what the partition does), because the latter involves a ramp-up/ramp-down loop when the loop boundaries do not permit it to vectorize in whole units.

But let's move further discussion (if it's not about documentation) to the main Legion issue tracker.