chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Confusing behavior of StencilDist on writes to ghost cells #11741

Open e-kayrakli opened 5 years ago

e-kayrakli commented 5 years ago

(This is to capture a discussion with @benharsh over on gitter channel)

Current behavior of StencilDist is confusing. Assume the following is run by two locales:

var space = {0..7};
var dom = space dmapped Stencil(space, fluff=(1,), periodic=true);
var arr: [dom] int;

forall a in arr {
  a = here.id+10;
}

arr.updateFluff();

writeln(arr[-1]);   //read from the fluff; prints the value in arr[7]
arr[-1] = 7;         //write to the fluff; doesn't modify arr[7]
arr[4] = 7;          //write to the fluff; modifies arr[4] correctly

arr.updateFluff();

writeln(arr);

I expect arr[-1] to behave similarly when it is read/written, but it is not the case. I understand that such behavior can be useful in some applications, but I think it breaks the generality of StencilDist, a distribution that is packaged with the language.

e-kayrakli commented 5 years ago

A real-world motivating example is in https://github.com/chapel-lang/chapel/blob/master/test/users/npadmana/npb-mg/mg.chpl#L301

These two lines are in the following loop:

for k in vectorizeOnly(klo..khi) {
  const val1 = locSrc[i,j,k];
  const val2 = valA(i,j,k);
  const val3 = valB(i,j,k);

  const myVal2 = val2 + locSrc[i,j,k-1] + locSrc[i,j,k+1];

  dest.localAccess[i,j,k] += w0*val1 + w1*myVal2 + w2*val3;

  // Update previous and next destinations with this iteration's
  // val2 and val3
  const temp = w2 * val2 + w3 * val3;
  dest.localAccess[i,j,k-1] += temp;
  dest.localAccess[i,j,k+1] += temp;
}

where dest is a periodic Stencil mapped array in all dimensions with radius 1. The last two lines in the loop look like they are updating external ghost zones, which should be wrapped around in that dimension. But it is not the case, updates on external ghost zones are not done on the locations of the actual data. (Doing so would cause the validation to fail on this benchmark, therefore if the behavior changes, this benchmark need to be modified as well)

e-kayrakli commented 5 years ago

I believe the behavior should change. But if it doesn't or if it is not going to happen in the near future, change in the docs can also help. In my opinion:

bradcray commented 5 years ago

With the caveat that I'm not as expert in StencilDist as others (ahem... @benharsh), here are my impressions:

When you say:

arr[4] = 7;          //write to the fluff; modifies arr[4] correctly

you say "write to the fluff", but this is actually writing to the original arr[4] value directly, not the fluff copy of it, isn't it?

By the same token, I might think that:

arr[-1] = 7;         //write to the fluff; doesn't modify arr[7]

should be illegal since it's outside of the bounding box of arr's domain. Yet, I'd still permit reads to the element:

writeln(arr[-1]);   //read from the fluff; prints the value in arr[7]

(and since we can distinguish between reads and writes to array elements, we should be able to support this distinction in behavior).

More specifically, I don't view fluff / ghost cells / halos as being automatically kept coherent with the values that they're mirroring, but caches of those values. As such, I think the caches should be read-only, not writeable via default actions (without taking more heroic steps to interact with the fluff values directly).

What do you guys think about this proposal, @e-kayrakli and @benharsh?

Having the fluff values automatically kept coherent with their original values could be particularly problematic in a 2D distribution where a given corner value would belong to three locales' fluff regions (its orthogonal neighbors and the diagonal neighbor), s.t. if anyone modified one, all three replicants would have to be updated?

benharsh commented 5 years ago

I'm open to redefining arr[-1] to halt for writes (i.e. when the 'ref' return-intent option is invoked).

I think we still want some way to modify the ghost cell data for certain use-cases (e.g. updating position offsets as we do in miniMD). This could be done with some kind of explicit method and array-fowarding (e.g. arr.ghost[-1] = 7). Or we could allow the user to somehow express a way that ghost cells should be mutated during updateFluff.

bradcray commented 5 years ago

I'm definitely open to some sort of method for explicitly reading/writing the halo cells. I'm guessing we'd want to restrict it to only support accessing local halo cells given the "a given index may have multiple halo copies" issue I mentioned above (?). If so, we could potentially re-use .localAccess() for this purpose though it might be too weird to let .localAccess() refer to different elements than a normal access is able to.

e-kayrakli commented 5 years ago

I think I am biased in the way I think as what I am trying to do in my work is to have all this stuff handled by the runtime where consistency issues handled automatically. Things get blurry when you try to do it without runtime support.

More specifically, I don't view fluff / ghost cells / halos as being automatically kept coherent with the values that they're mirroring, but caches of those values. As such, I think the caches should be read-only, not writeable via default actions (without taking more heroic steps to interact with the fluff values directly).

I think making arr[-1] = 7 illegal (at least when boundsChecking enabled) is an option.

Further, I wonder if it is easier to have these caches write-through instead of write-back. That way at least things are locally consistent. i.e. I can write to the fluff which will update both the fluff and the origin data. I don't think this is as complicated as automatic coherence.

As for the corner values which can be in the fluff of multiple locales, write-through should also address that. My writes to the fluff is visible to me and the owner. If between updateFluff calls multiple locales writes to the same location, that race condition is a user error.

But still, write-through may also require some support from runtime and compiler.

bradcray commented 5 years ago

Engin: When you say "what I am trying to do in my work", are you trying to write some different stencil computation, or to lean on the Stencil distribution to do something other than stencils? Can you say more about what is motivating you to write to the fluff to begin with?

e-kayrakli commented 5 years ago

I am not particularly worried about being able to write to the fluff. For this specific issue, the snippet in my first comment was the trigger.

for k in vectorizeOnly(klo..khi) {
  const val1 = locSrc[i,j,k];
  const val2 = valA(i,j,k);
  const val3 = valB(i,j,k);

  const myVal2 = val2 + locSrc[i,j,k-1] + locSrc[i,j,k+1];

  dest.localAccess[i,j,k] += w0*val1 + w1*myVal2 + w2*val3;

  // Update previous and next destinations with this iteration's
  // val2 and val3
  const temp = w2 * val2 + w3 * val3;
  dest.localAccess[i,j,k-1] += temp;
  dest.localAccess[i,j,k+1] += temp;
}

This is from users/npadmana/npb-mg/mg.chpl The last two lines in the loop are very confusing to me. At the ends of the loop, they look like they are writing to the fluff (here the distribution is periodic), but they are not.

That's why I am OK with an error message, as well.

e-kayrakli commented 5 years ago

Another slightly milder change that I can think of is to have a warning only when boundsChecking is enabled. Arguably this is about OOB accesses and whether they should be allowed in periodic stencils. So, it feels like a good way to inform the user. Going back to my case where I encountered this, such a warning would have helped me go "a-ha!" much earlier.

P.S. I still believe the benchmark implementation should be modified and I can create a PR just to show the changes that I think should be done in the near future. You can shoot down the specific ideas under that PR :)

bradcray commented 5 years ago

I was imagining that, like other bounds checks, we'd only do it when checking was on. But I wouldn't suggest to people that they should turn off bounds checking if they wanted to write the the fluff; I'd just say that it's an incorrect program that we weren't preventing them from writing.

I'm definitely up for seeing a PR that improves the clarity / purity of the benchmark.

npadmana commented 5 years ago

I just wanted to add in my $0.02.

First, I agree that the benchmark is confusing. I just spent a bunch of time staring at it, and it wasn't immediately obvious what I was doing here. But here's what I recall. What the code is doing here is an optimization on the stencil convolution to minimize the number of accesses to every element. I remember this being non-intuitive the first time I wrote it, and it still is!

Where this gets confusing is that (I think) I'm using the dest array as scratch space here, including the ghost cells. Hence the localAccess, which means that the two lines at the end are just modifying the local arrays/ghost cells.

In this particular case, I do not want the updates to propagate into the full array. In fact, that would result in a race as well, since the loop over k must be done serially (hence the vectorizeOnly step).

Sorry if this just is repeating what's been already said....

npadmana commented 5 years ago

A few more thoughts...

I would push strongly for arr[-1] to be legal for StencilDist with periodic boundary conditions (assuming a domain of {0..N}). I'd similarly argue that arr[N+1] should also be allowed. It significantly simplifies user code which otherwise gets littered with checks and manual wrap-arounds.

I actually also find the ability to write locally to ghost cells very useful, and I would love to have the ability to accumulate the values of the ghosts back onto the real cell. The specific case that I keep running into is mass deposition onto a grid, where a particle gets spread out over multiple grid cells. I can do the writing with localAccess, but the step that always trips me up is how to efficiently accumulate from the ghost cells back. It's easy in the case where the stencil is 1D, but I don't understand the internals of StencilDist well enough to see how to do it in the general case. Is there a way to do this?

e-kayrakli commented 5 years ago

Thanks @npadmana! I think all these confirms what I understand from the implementation. And since I wasn't fully aware of StencilDist's semantics it took me a while to come to that understanding.

As for vectorizeOnly: precisely because the k loop must be done serially, it shouldn't be vectorized (which is a form of parallelism). As @mppf mentioned over gitter, the loop is not vectorized unless you throw in an additional compiler flag, that's why I believe this code is functioning correctly in the nightlies. But haven't checked myself to see if it breaks once vectorization is enabled. But anyways, if I am able to keep my promise and open a PR about this, I can elaborate more on there.

npadmana commented 5 years ago

@e-kayrakli --- agreed that the vectorizeOnly call is a bug.

benharsh commented 5 years ago

I actually also find the ability to write locally to ghost cells very useful, and I would love to have the ability to accumulate the values of the ghosts back onto the real cell. The specific case that I keep running into is mass deposition onto a grid, where a particle gets spread out over multiple grid cells. I can do the writing with localAccess, but the step that always trips me up is how to efficiently accumulate from the ghost cells back. It's easy in the case where the stencil is 1D, but I don't understand the internals of StencilDist well enough to see how to do it in the general case. Is there a way to do this?

I implemented a very simple version of this for an elegant CoMD study version a while ago. I made a variant of the Stencil distribution called "AccumStencilDist", in which updateFluff() sent ghost cells to neighbors and those cells were added to the 'real' data with +=.

If I remember correctly, this was used in the case when atoms moved from a 'real' cell into a ghost cell. When 'updateFluff' was called, this atom was sent to the neighbor locale and appended to the corresponding 'real' cell's list of atoms.

It's been a while since I looked at the code, but I'd be happy to explain it further if you like.

npadmana commented 5 years ago

@benharsh -- how hard would it be to include this as an eg. accumFromFluff method directly in StencilDist? I quickly looked, and it seems like there are a number of internal routines that do not exist in StencilDist. I'm just not sure how far out of sync the two distributions have gone.

benharsh commented 5 years ago

It wouldn't be easy. AccumStencilDist's updateFluff currently exchanges one dimension at a time (maybe to avoid races when accumulating?), and I think StencilDist sends to all of its neighbors at once because there are no overlapping elements. Each probably has different performance consequences.

I think it would be interesting to try and find a way to more easily allow advanced users to write their own updateFluff implementations without having to touch StencilDist.chpl directly.

npadmana commented 5 years ago

Sorry -- I wasn't being clear here. I meant having a separate method accumFromFluff which did the accumulation step. At the end of this, the fluff caches would be out-of-sync, so the user would have to call updateFluff to clean up.

benharsh commented 5 years ago

Ah, I see. If performance wasn't a concern, I think that could be implemented relatively easily. For good performance I think we'd want some kind of buffer to send to, from which elements could be accumulated onto the real data. There are some existing buffers in StencilDist's classes, maybe those could be re-used?

bradcray commented 3 years ago

Just noting that a user bumped into a confusing behavior by modifying ghost cells in their Stencil-distributed arrays in this Discourse thread: https://chapel.discourse.group/t/block-stencil-dists-when-simulating-multi-locale-in-a-single-local-machine/5774

It's a bit odd in that it isn't behaving as the Stencil documentation advertises:

Any write to array data will be applied to the actual element, the same as if you were using a Block-distributed array.

And it's not as complicated as the examples in much of this thread because it doesn't involve the global boundary conditions, just the local ones between locales. I'm taking a look at the code to see why it's happening in this way.

bradcray commented 3 years ago

In the user issue I referenced above, I also found myself wanting Engin's notion of supporting "write-through" ghost cells. Initially, I was thinking that this would be trivial ("We'll just make dsiAccess() write two things rather than one!") but I forgot that dsiAccess() doesn't do the writing, it just returns a reference to the thing that should be written (so would need to somehow return two references to update if we wanted to update two copies of the array element). This is similar to something I've been wanting in another context, so I opened up https://github.com/chapel-lang/chapel/issues/17999 to capture both cases in one place.