Closed e-kayrakli closed 11 months ago
Note that I have some refactor/cleanup wishes that are somewhat related to (2). We might consider doing that refactor first. See https://github.com/Cray/chapel-private/issues/5493.
Could another possible approach involve a packed buffer of data sent with a single GET/PUT, with a dedicated kernel to do the packing/unpacking as needed? I think this would extend more easily to N-dimensional arrays, rather than constraining ourselves to the 2D/3D builtins. We'd also need a host implementation of the same thing.
For the time being I plan on just going with a simple loop of memcpys.
That's interesting. My not-so-well-thought-out initial response is that kernel performance is typically bad when all it is doing is memory access. This approach would mean the kernel will do a global (GPU) memory read followed by a write while unpacking. Given that you'll have to do it elementwise and not in bulk, I am suspicious of its performance. OTOH, AMD optionally uses kernels to move data around between GPUs, at least with older versions of ROCm (see https://github.com/chapel-lang/chapel/issues/23549, but probably too much into the weeds). So, maybe I am exaggerating the potential performance issue.
Another aspect of using a kernel under the hood is that it can result in unpredictable performance with communication/computation overlap. Where you expect an overlap, you may not get it because your communication also involves a kernel launch. Again, probably being overly cautious here.
Though I can totally see how this can be a general approach that can cover arrays with rank >3. And maybe we can do it as the fallback down the road for many-dimensional arrays?
There are definitely a bunch of unknowns when it comes to this kind of approach, so some kind of investigation would be worthwhile before diving in. I have to imagine though that we're not the only ones that want to deal with these kinds of operations for rank>3 arrays, so I'm hoping there's some prior work we can build off of out there.
I don't move the debate forward, but would like to point out that I faced this issue today, trying to copy a non-contiguous subset of indices to/from GPU (toy code below).
var A: [1..5] int = noinit;
on here.gpus[0] {
var B: [1..10] int = [0,1,0,1,1,1,0,0,1,0];
A = B[[2,4,5,6,9]]; // expects: A = 1 1 1 1 1
}
``
I happen to be working on resolving this issue at the moment and expect a fix to be in the upcoming 1.33 release.
That said, I suspect your program is more about Chapel's concept of "promotion" rather than the non-contiguous data transfer problem described in this issue. I'll see if I can get a team member with more gpu experience to confirm whether that is the case.
https://github.com/chapel-lang/chapel/pull/23848 may have fixed the original bug, but I think it would be good to confirm whether coral
is fixed before closing.
We've confirmed that the original bug blocking coral
has been fixed, so I'm going to close this issue.
Guillaume, a team member has confirmed that your sample program involves an issue with promotion rather than the kind of array slicing in this issue. I believe @e-kayrakli will be opening a separate issue to track the promotion bug.
Has two issues:
Note that this currently blocks
coral
to use multiple GPUs, because it requires a large, 3D array to be sliced up to pieces.In terms of implementing (2), a straightforward approach of looping a
chpl_gpu_memcpy
call is a definitely viable implementation. If we want to improve the implementation, we can should look into things likecuMemcpy2DAsync
: but this looks like it accepts "pitches" only as large as those that can be returned bycuMemAllocPitch
, which we don't use. I suspect the pitches that can be returned from that is much smaller than "strides" we need for our purposes here.cuMemcpy2DUnaligned
: doesn't have the same limitation, but can be slower and also does not haveasync
version.We could also choose between the two strategies depending on the stride we have.