NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
260 stars 51 forks source link

Matmul Schedules requires vectorization Analysis, it is currently hard coded #2083

Open kevinstephano opened 6 months ago

kevinstephano commented 6 months ago

682 Issue related to vectorization

@drzejan2 had a PR: https://github.com/NVIDIA/Fuser/pull/807

zasdfgbnm commented 6 months ago

Also: https://github.com/NVIDIA/Fuser/issues/983

jacobhinkle commented 6 months ago

I want to be sure I understand this issue properly.

We inline loads of C role tensors (e.g. bias) in a loop with the epilogue computation and writes to the output, and we propagate transforms and parallelization such that they will share a vectorization factor. So the analysis of runtime inputs that we need is to find the maximum supported vectorization factor of C role inputs, not to exceed the vectorization factor of the output.

Question: suppose we have an HSH matmul and would like to use vec_size=8 for the output, but the bias is only contiguous with vec_size=4. Would we prefer to set the output to vec_size=4, or vectorize the bias with 4 with an unrolled length-2 outer loop and leave the output with vec_size=8?

zasdfgbnm commented 6 months ago

We inline loads of C role tensors (e.g. bias) in a loop with the epilogue computation and writes to the output, and we propagate transforms and parallelization such that they will share a vectorization factor. So the analysis of runtime inputs that we need is to find the maximum supported vectorization factor of C role inputs, not to exceed the vectorization factor of the output.

Yes, but I think the issue is more than that. Besides the issue you mentioned, we also have https://github.com/NVIDIA/Fuser/blob/65c5b6ea3fe4589a56dc3d49babe2b7591b55d98/csrc/scheduler/matmul.cpp#L554-L561

which hardcode the gmem load of A and B load to vectorize 8. This needs to be modified too.

zasdfgbnm commented 6 months ago

Question: suppose we have an HSH matmul and would like to use vec_size=8 for the output, but the bias is only contiguous with vec_size=4. Would we prefer to set the output to vec_size=4, or vectorize the bias with 4 with an unrolled length-2 outer loop and leave the output with vec_size=8?

Good question, and I don't know the answer. My suggestion is, how often does these case happen in real world? If it is rare, then whatever takes the least amount of effort to make it functionally correct is the way to go. If both options have similar difficulty, then just randomly pick one.

jacobhinkle commented 6 months ago

which hardcode the gmem load of A and B load to vectorize 8. This needs to be modified too.

Ah right of course. So the point would be to make it so that we could safely run with any shape and alignment of inputs, including operands and bias-like inputs. Makes sense.

jacobhinkle commented 6 months ago

If both options have similar difficulty, then just randomly pick one.

I would imagine it's easier to just use the same minimum vec size for both bias and output. Then again this estimate of difficulty is somewhat random, so this might count as randomly picking an approach :rofl:

jacobhinkle commented 6 months ago

Next question: I am playing in a branch with a test having misaligned rows eg. K=513. This leads to supported vectorization of 1 for A and B. In cases like this, when we do not vectorize to 16 bytes, we cannot use cp.async.cg [^1] [^2]. We could use cp.async.ca if there are at least 4 bytes by setting CacheOp::AllLevels instead of CacheOp::Global, but in the case of vec_size=1 we have only 2 bytes. In that case I suppose we might want to disable circular buffering, although we could still double buffer with async_gmem_load_operands = false. What do you think about this scenario @zasdfgbnm?

References: [^1]: cp.async docs [^2]: nvfuser validation of cp-size

zasdfgbnm commented 6 months ago

Next question: I am playing in a branch with a test having misaligned rows eg. K=513. This leads to supported vectorization of 1 for A and B. In cases like this, when we do not vectorize to 16 bytes, we cannot use cp.async.cg 1 2. We could use cp.async.ca if there are at least 4 bytes by setting CacheOp::AllLevels instead of CacheOp::Global, but in the case of vec_size=1 we have only 2 bytes. In that case I suppose we might want to disable circular buffering, although we could still double buffer with async_gmem_load_operands = false. What do you think about this scenario @zasdfgbnm?

References:

Footnotes

  1. cp.async docs
  2. nvfuser validation of cp-size

Makes sense to me. I can not think of a better solution.

zasdfgbnm commented 5 months ago

Reopen this, because epilogue is not done yet