Array slice copies to/from GPU with non-consecutive data results in internal error/segmentation fault

e-kayrakli commented 1 year ago

var HostArr: [1..10, 1..10] int;

on here.gpus[0] {
  var DevArr: [1..10, 1..2] int;

  DevArr = HostArr[1..10, 1..2]; // copy the first 2 elements of each row

  @assertOnGpu
  foreach d in DevArr do d = 1;

  HostArr[1..10, 1..2] = DevArr;
}

Has two issues:

it bumps up against https://github.com/chapel-lang/chapel/issues/23658 which has caused other issues recently too. So, I am hoping that we can address that relatively quickly.
even with a hacky solution to avoid (1), we get a segfault as we currently don't have strided GET/PUT's that are aware of GPU sublocales. We end up calling regular strided GET/PUT that ignores GPU memory altogether, which would end up accessing illegal memory on the host. Note that https://github.com/chapel-lang/chapel/pull/22349 has added regular, GPU-aware GET/PUT calls in the runtime.

Note that this currently blocks coral to use multiple GPUs, because it requires a large, 3D array to be sliced up to pieces.

In terms of implementing (2), a straightforward approach of looping a chpl_gpu_memcpy call is a definitely viable implementation. If we want to improve the implementation, we can should look into things like

cuMemcpy2DAsync: but this looks like it accepts "pitches" only as large as those that can be returned by cuMemAllocPitch, which we don't use. I suspect the pitches that can be returned from that is much smaller than "strides" we need for our purposes here.
cuMemcpy2DUnaligned: doesn't have the same limitation, but can be slower and also does not have async version.

We could also choose between the two strategies depending on the stride we have.

e-kayrakli commented 1 year ago

Note that I have some refactor/cleanup wishes that are somewhat related to (2). We might consider doing that refactor first. See https://github.com/Cray/chapel-private/issues/5493.

benharsh commented 1 year ago

Could another possible approach involve a packed buffer of data sent with a single GET/PUT, with a dedicated kernel to do the packing/unpacking as needed? I think this would extend more easily to N-dimensional arrays, rather than constraining ourselves to the 2D/3D builtins. We'd also need a host implementation of the same thing.

For the time being I plan on just going with a simple loop of memcpys.

e-kayrakli commented 1 year ago

That's interesting. My not-so-well-thought-out initial response is that kernel performance is typically bad when all it is doing is memory access. This approach would mean the kernel will do a global (GPU) memory read followed by a write while unpacking. Given that you'll have to do it elementwise and not in bulk, I am suspicious of its performance. OTOH, AMD optionally uses kernels to move data around between GPUs, at least with older versions of ROCm (see https://github.com/chapel-lang/chapel/issues/23549, but probably too much into the weeds). So, maybe I am exaggerating the potential performance issue.

Another aspect of using a kernel under the hood is that it can result in unpredictable performance with communication/computation overlap. Where you expect an overlap, you may not get it because your communication also involves a kernel launch. Again, probably being overly cautious here.

Though I can totally see how this can be a general approach that can cover arrays with rank >3. And maybe we can do it as the fallback down the road for many-dimensional arrays?

benharsh commented 1 year ago

There are definitely a bunch of unknowns when it comes to this kind of approach, so some kind of investigation would be worthwhile before diving in. I have to imagine though that we're not the only ones that want to deal with these kinds of operations for rank>3 arrays, so I'm hoping there's some prior work we can build off of out there.

Guillaume-Helbecque commented 12 months ago

I don't move the debate forward, but would like to point out that I faced this issue today, trying to copy a non-contiguous subset of indices to/from GPU (toy code below).


var A: [1..5] int = noinit;

on here.gpus[0] {
  var B: [1..10] int = [0,1,0,1,1,1,0,0,1,0];

  A = B[[2,4,5,6,9]]; // expects: A = 1 1 1 1 1
}
``

benharsh commented 12 months ago

I happen to be working on resolving this issue at the moment and expect a fix to be in the upcoming 1.33 release.

That said, I suspect your program is more about Chapel's concept of "promotion" rather than the non-contiguous data transfer problem described in this issue. I'll see if I can get a team member with more gpu experience to confirm whether that is the case.

benharsh commented 11 months ago

https://github.com/chapel-lang/chapel/pull/23848 may have fixed the original bug, but I think it would be good to confirm whether coral is fixed before closing.

benharsh commented 11 months ago

We've confirmed that the original bug blocking coral has been fixed, so I'm going to close this issue.

Guillaume, a team member has confirmed that your sample program involves an issue with promotion rather than the kind of array slicing in this issue. I believe @e-kayrakli will be opening a separate issue to track the promotion bug.

e-kayrakli commented 11 months ago

https://github.com/chapel-lang/chapel/issues/23659#issuecomment-1810811132 is captured in https://github.com/chapel-lang/chapel/issues/23911.

chapel-lang / chapel

Array slice copies to/from GPU with non-consecutive data results in internal error/segmentation fault #23659