NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
261 stars 51 forks source link

Add a new vectorization parallel type #981

Open zasdfgbnm opened 1 year ago

zasdfgbnm commented 1 year ago

Currently, we supports two types of vectorization: Vectorize and MisalignedVectorize. Vectorize requires both the base address and the number of items to be a multiple of vector size. MisalignedVectorize requires neither of them to be a multiple of vectorize size, and does shifting on the fly.

There can be another type of vectorization, which requires the base address to be a multiple of vector size, but not the number of items. Supporting this vectorization needs to leverage a feature in cp.async: https://github.com/NVIDIA/Fuser/blob/main/csrc/ops/arith.h

Instruction cp.async allows optionally specifying a 32-bit integer operand src-size. Operand src-size represents the size of the data in bytes to be copied from src to dst and must be less than cp-size. In such case, remaining bytes in destination dst are filled with zeros. Specifying src-size larger than cp-size results in undefined behavior.

zasdfgbnm commented 1 year ago

cc: @mmigdal-nv tracking the thing you were working on as an issue

Michoumichmich commented 1 year ago

I have seen some non negligible performance regressions from using the src-size version of cp.async. Let me gather some numbers tomorrow. It could still be better to discard/zero out the unused bytes from the kernel - and keep this cp.async variant for the last cp.async, to not cause OOB accesses... even though in practice they might never cause crashes due to memory allocation alignment

zasdfgbnm commented 1 year ago

I remember checked your branch at some point, and you were doing a if branch inside the cpAsyncCa to dispatch different variants of cp.async. Could that be the reason? I think we should do dispatching at compile time, and the src-size should be computed as something like size - first_index.

mmigdal-nv commented 1 year ago

and you were doing a if branch inside the cpAsyncCa to dispatch different variants of cp.async.

Yes, IIRC it was because using src_size=0 or 16 was slower than the version that used the predicate. So I tried using as much of the predicated version as possible

mmigdal-nv commented 1 year ago

Can nvFuser lower misaligned vectorize to cp async?

naoyam commented 1 year ago

Can nvFuser lower misaligned vectorize to cp async?

Not at this moment, but I see no significant technical barrier.

zasdfgbnm commented 1 year ago

Can nvFuser lower misaligned vectorize to cp async?

No, and I don't think misaligned vectorize makes sense for this. For this case, it is aligned, just not a multiple of vector size on the edge. I think we should limit the parallel type Vectorize to the case where the number of items is a multiple of vector size, because this has no special handling and fast. We should also limit MisalignedVectorize to the case where we should handle both the base address and the end. For this case, it is neither Vectorize nor MisalignedVectorize, it is another vectorization type.

MisalignedVectorize was implemented at some point, but got deprioritized later.