Open kevinstephano opened 6 months ago
I want to be sure I understand this issue properly.
We inline loads of C
role tensors (e.g. bias) in a loop with the epilogue computation and writes to the output, and we propagate transforms and parallelization such that they will share a vectorization factor. So the analysis of runtime inputs that we need is to find the maximum supported vectorization factor of C
role inputs, not to exceed the vectorization factor of the output.
Question: suppose we have an HSH matmul and would like to use vec_size=8 for the output, but the bias is only contiguous with vec_size=4. Would we prefer to set the output to vec_size=4, or vectorize the bias with 4 with an unrolled length-2 outer loop and leave the output with vec_size=8?
We inline loads of
C
role tensors (e.g. bias) in a loop with the epilogue computation and writes to the output, and we propagate transforms and parallelization such that they will share a vectorization factor. So the analysis of runtime inputs that we need is to find the maximum supported vectorization factor ofC
role inputs, not to exceed the vectorization factor of the output.
Yes, but I think the issue is more than that. Besides the issue you mentioned, we also have https://github.com/NVIDIA/Fuser/blob/65c5b6ea3fe4589a56dc3d49babe2b7591b55d98/csrc/scheduler/matmul.cpp#L554-L561
which hardcode the gmem load of A and B load to vectorize 8. This needs to be modified too.
Question: suppose we have an HSH matmul and would like to use vec_size=8 for the output, but the bias is only contiguous with vec_size=4. Would we prefer to set the output to vec_size=4, or vectorize the bias with 4 with an unrolled length-2 outer loop and leave the output with vec_size=8?
Good question, and I don't know the answer. My suggestion is, how often does these case happen in real world? If it is rare, then whatever takes the least amount of effort to make it functionally correct is the way to go. If both options have similar difficulty, then just randomly pick one.
which hardcode the gmem load of A and B load to vectorize 8. This needs to be modified too.
Ah right of course. So the point would be to make it so that we could safely run with any shape and alignment of inputs, including operands and bias-like inputs. Makes sense.
If both options have similar difficulty, then just randomly pick one.
I would imagine it's easier to just use the same minimum vec size for both bias and output. Then again this estimate of difficulty is somewhat random, so this might count as randomly picking an approach :rofl:
Next question: I am playing in a branch with a test having misaligned rows eg. K=513. This leads to supported vectorization of 1 for A and B. In cases like this, when we do not vectorize to 16 bytes, we cannot use cp.async.cg
[^1] [^2]. We could use cp.async.ca
if there are at least 4 bytes by setting CacheOp::AllLevels
instead of CacheOp::Global
, but in the case of vec_size=1
we have only 2 bytes. In that case I suppose we might want to disable circular buffering, although we could still double buffer with async_gmem_load_operands = false
. What do you think about this scenario @zasdfgbnm?
References: [^1]: cp.async docs [^2]: nvfuser validation of cp-size
Next question: I am playing in a branch with a test having misaligned rows eg. K=513. This leads to supported vectorization of 1 for A and B. In cases like this, when we do not vectorize to 16 bytes, we cannot use
cp.async.cg
1 2. We could usecp.async.ca
if there are at least 4 bytes by settingCacheOp::AllLevels
instead ofCacheOp::Global
, but in the case ofvec_size=1
we have only 2 bytes. In that case I suppose we might want to disable circular buffering, although we could still double buffer withasync_gmem_load_operands = false
. What do you think about this scenario @zasdfgbnm?References:
Footnotes
Makes sense to me. I can not think of a better solution.
Reopen this, because epilogue is not done yet
682 Issue related to vectorization
@drzejan2 had a PR: https://github.com/NVIDIA/Fuser/pull/807