Open zasdfgbnm opened 1 year ago
cc: @mmigdal-nv tracking the thing you were working on as an issue
I have seen some non negligible performance regressions from using the src-size version of cp.async. Let me gather some numbers tomorrow. It could still be better to discard/zero out the unused bytes from the kernel - and keep this cp.async variant for the last cp.async, to not cause OOB accesses... even though in practice they might never cause crashes due to memory allocation alignment
I remember checked your branch at some point, and you were doing a if
branch inside the cpAsyncCa
to dispatch different variants of cp.async. Could that be the reason? I think we should do dispatching at compile time, and the src-size should be computed as something like size - first_index
.
and you were doing a if branch inside the cpAsyncCa to dispatch different variants of cp.async.
Yes, IIRC it was because using src_size=0 or 16
was slower than the version that used the predicate. So I tried using as much of the predicated version as possible
Can nvFuser lower misaligned vectorize to cp async?
Can nvFuser lower misaligned vectorize to cp async?
Not at this moment, but I see no significant technical barrier.
Can nvFuser lower misaligned vectorize to cp async?
No, and I don't think misaligned vectorize makes sense for this. For this case, it is aligned, just not a multiple of vector size on the edge. I think we should limit the parallel type Vectorize
to the case where the number of items is a multiple of vector size, because this has no special handling and fast. We should also limit MisalignedVectorize
to the case where we should handle both the base address and the end. For this case, it is neither Vectorize
nor MisalignedVectorize
, it is another vectorization type.
MisalignedVectorize
was implemented at some point, but got deprioritized later.
Currently, we supports two types of vectorization:
Vectorize
andMisalignedVectorize
.Vectorize
requires both the base address and the number of items to be a multiple of vector size.MisalignedVectorize
requires neither of them to be a multiple of vectorize size, and does shifting on the fly.There can be another type of vectorization, which requires the base address to be a multiple of vector size, but not the number of items. Supporting this vectorization needs to leverage a feature in
cp.async
: https://github.com/NVIDIA/Fuser/blob/main/csrc/ops/arith.h