Closed naoyam closed 1 year ago
The low performance on current main branch is expected as the redundant casts are not removed. I checked the perf using jie's branch #355 . All looks good except for the one with a hidden size of 5140
.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/768/manual_time 26.7 us 108 us 519 bytes_per_second=876.955G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 1 / vectorize / factor 8/Launch_Parameters[block(1/1/96)/grid(1/1/8192)/384]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/1024/manual_time 32.8 us 117 us 423 bytes_per_second=952.711G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 1 / vectorize / factor 8/Launch_Parameters[block(1/1/128)/grid(1/1/8192)/512]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/1536/manual_time 45.5 us 130 us 304 bytes_per_second=1030.3G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 2 / vectorize / factor 8/Launch_Parameters[block(1/1/96)/grid(1/1/8192)/384]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/1280/manual_time 44.3 us 127 us 313 bytes_per_second=881.201G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 2 / vectorize / factor 8/Launch_Parameters[block(1/1/96)/grid(1/1/8192)/384]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/1600/manual_time 55.4 us 142 us 251 bytes_per_second=882.336G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 2 / vectorize / factor 8/Launch_Parameters[block(1/1/128)/grid(1/1/8192)/512]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/2048/manual_time 58.0 us 146 us 241 bytes_per_second=1078.75G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 2 / vectorize / factor 8/Launch_Parameters[block(1/1/128)/grid(1/1/8192)/512]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/2560/manual_time 77.5 us 156 us 180 bytes_per_second=1008.31G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 3 / vectorize / factor 8/Launch_Parameters[block(1/1/128)/grid(1/1/8192)/512]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/4096/manual_time 105 us 194 us 133 bytes_per_second=1.1619T/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 4 / vectorize / factor 8/Launch_Parameters[block(1/1/128)/grid(1/1/8192)/512]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/5140/manual_time 304 us 385 us 46 bytes_per_second=515.708G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: unroll / factor 2 // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 5 / vectorize / factor 4/Launch_Parameters[block(1/1/288)/grid(1/1/4096)/1152]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/12288/manual_time 328 us 410 us 42 bytes_per_second=1.11537T/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 6 / vectorize / factor 8/Launch_Parameters[block(1/1/256)/grid(1/1/8192)/1024]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/16384/manual_time 514 us 602 us 27 bytes_per_second=973.484G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 8 / vectorize / factor 8/Launch_Parameters[block(1/1/256)/grid(1/1/8192)/1024]
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/20480/manual_time 672 us 747 us 21 bytes_per_second=929.899G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 10 / vectorize / factor 8/Launch_Parameters[block(1/1/256)/grid(1/1/8192)/1024]
The perf of 5140
is low because:
(1) it is vectorized by 4 while other cases are vectorized by 8. The kernel is using a max_unroll = 16 / max_input_dtype_size = 8
, additional 2 unroll is put to iter_unroll_factor
. This iter_unroll_factor
increased persistent buffer size and extra code to process unroll leads to high register useage. e.g. it needs 155 registers (if I run with PYTORCH_NVFUSER_MAX_REG_COUNT=255
) while the estimated register is 80.
(2) it is not a multiple of warp size and bdimx is padded.
If I set iter_unroll_factor = 1
, the bandwidth increased from 515.708G/s to 806.959G/s. So we should probabally stop using iter_unroll_factor
and extend the use of multiple_reds_per_blk
to take care of cases where we have a large iter domain but small reduction domain. Working on a PR.
Didn't you also find some issue with selection of persistent buffers? Can't understand why the castOp itself could make such a large difference. Have you already filed an issue?
Do you have any idea why the register usage is so much higher than the estimate? Using a different scheduling heuristic may be a way to go, but before making a decision, we should at least try to understand why.
Didn't you also find some issue with selection of persistent buffers? Can't understand why the castOp itself could make such a large difference. Have you already filed an issue?
see https://github.com/NVIDIA/Fuser/issues/343
Do you have any idea why the register usage is so much higher than the estimate? Using a different scheduling heuristic may be a way to go, but before making a decision, we should at least try to understand why.
Because iter_unroll_factor = 2
means there is an additional unrolled loop on top of vectorization and persistent batch. So the registers to store the buffers and overhead are doubled, if succssfully unrolled. This explains why used register is 155 while estimated is 80.
Doesn't that mean we should also fix how to estimate register usage?
I looked at the generated code of the 5140 case. A couple of things I noticed:
Here's the initial read of the input tensor:
if ((((((i185 * 16) + i546) + 3) < T0.size[0]) && ((1 + i3549) < T1.size[0]))) {
Array<__half, 40, 4> T42;
#pragma unroll
for(nvfuser_index_t i396 = 0; i396 < 2; ++i396) {
int i487;
i487 = 20 * i396;
#pragma unroll
for(nvfuser_index_t i397 = 0; i397 < 5; ++i397) {
T42.set(__half(0));
}
}
NVFUSER_UPDATE_MAGIC_ZERO;
#pragma unroll
for(nvfuser_index_t i396 = 0; i396 < 2; ++i396) {
int i551;
i551 = i549 + (T0.size[0] * i396);
int i577;
i577 = 20 * i396;
#pragma unroll
for(nvfuser_index_t i397 = 0; i397 < 5; ++i397) {
loadGlobalToLocal<__half, 4, false>(&T42[(i577 + (4 * i397))], &T1[(i551 + (i552 * (i397 + nvfuser_zero)))]);
}
}
NVFUSER_UPDATE_MAGIC_ZERO;
// Alias Allocation - register
auto& T46 = T42;
#pragma unroll
for(nvfuser_index_t i435 = 0; i435 < 2; ++i435) {
int i742;
I wonder why T42
is not inlined with the i435
loop. It seems the kernel reads the whole input tensor once, does the normalization, and then writes the whole output. I suppose T42
is not a persistent buffer, but it's merely created due to the computeAt position. It would be interesting to try inlining it with the computation loop.
I don't think this is a scheduling issue, but it seems there're two identical sum reductions of T15
, likely because the input fusion has sum
and variance
. Can this benchmark use Welford
instead?
Assuming T42
is not a persistent buffer, the other relatively large buffers are:
__half T14[20];
float T15[20];
These are inside the unswitch loop of size 2, so the compile may double the size, which should still result in 60 registers, meaning the additional "overhead" is nearly 100 registers, which seems quite high.
Doesn't that mean we should also fix how to estimate register usage?
I think we don't need to bother modifying the current register usage estimation to account or iter_unroll_factor
.
Instead, we can abandon this iter_unroll_factor
for persistent kernels. It uses too many registers and leads to low occupancy and slow kernels. Also, it is not efficient even for small cases not bounded by register usages, see #386
As a solution, we can use multiple_reds_per_blk
as it uses less registers and lead to faster kernel.
Have you actually fixed the estimate and compared the results?
Have you actually fixed the estimate and compared the results?
On main branch with iter_unroll_factor = 2
generated by heuristics, run with PYTORCH_NVFUSER_MAX_REG_COUNT=255
to remove register spills:
ptxas info : 307 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 157 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.715776 ms, achieved: 235.365 GB/s
It is slower than the original version with register spills:
ptxas info : 307 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 256 bytes stack frame, 248 bytes spill stores, 296 bytes spill loads
ptxas info : Used 72 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.339968 ms, achieved: 495.543 GB/s
Hmm, I don't get the same result as yours. Which version of the toolkit are you using?
My branch is:
commit ad96bd57f3733800fc1998935caa4adb7180d6bf (HEAD -> main, origin/main, origin/HEAD)
Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
Date: Tue May 23 14:56:57 2023 -0700
Allocation domain `SchedulerRuntimeInfo` fix (#392)
I'm using 11.8 and the 5140 case is by default:
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 208 bytes stack frame, 810 bytes spill stores, 924 bytes spill loads
ptxas info : Used 72 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/5140/manual_time 433 us 507 us 1621 bytes_per_second=362.589G/s Red On Fastest Dim // Persistent Kernel // // Iteration Domain: unroll / factor 2 // Inner Reduction Domain: cross block reduction / pad to warp / persistent batch - 5 / vectorize / factor 4/Launch_Parameters[block(1/1/288)/grid(1/1/4096)/1152]
Using the max reg environment variable results in an error:
PYTORCH_NVFUSER_MAX_REG_COUNT=255 PYTORCH_NVFUSER_DUMP=ptxas_verbose ./bin/nvfuser_bench --benchmark_filter="NvFuserScheduler_LayerNormFusedOp_fp16___GRAPH/NvFuserScheduler_LayerNormFusedOp_fp16/8192/5140/*"
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16___GRAPH/NvFuserScheduler_TIMM_BatchNorm_nhwc_BWD_fp16 will be repeated at least 120 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. NvFuserScheduler_TIMM_LayerNorm_fp16___GRAPH/NvFuserScheduler_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 270 times.
The number of inputs is very large. Baseline_TIMM_LayerNorm_fp16 will be repeated at least 120 times.
2023-05-23T17:28:50-07:00
Running ./bin/nvfuser_bench
Run on (64 X 3500 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x32)
L1 Instruction 32 KiB (x32)
L2 Unified 512 KiB (x32)
L3 Unified 16384 KiB (x8)
Load Average: 0.08, 0.12, 0.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
__tmp_kernel1.cu(9571): warning #550-D: variable "i487" was set but never used
__tmp_kernel1.cu(9818): warning #550-D: variable "i1827" was set but never used
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 182 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
zsh: abort (core dumped) PYTORCH_NVFUSER_MAX_REG_COUNT=255 PYTORCH_NVFUSER_DUMP=ptxas_verbose
Initially (first post, 4 days ago), I used jie's branch https://github.com/NVIDIA/Fuser/pull/355.
In yesterday's post, I retested on the main branch using case FusionLayerNormFusedOpsRedundantCast_CUDA
, needs to modify the batch size and hidden size. And also mannually remove the redandant cast.
Please post what you actually measured.
What I did on nvdl-a112-d001
using our nightly build pulled on 05/22/23
(1) code base:
commit ad96bd57f3733800fc1998935caa4adb7180d6bf (HEAD -> main, origin/main, origin/HEAD)
Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
Date: Tue May 23 14:56:57 2023 -0700
Allocation domain `SchedulerRuntimeInfo` fix (#392)
(2) Test case:
TEST_F(NVFuserTest, FusionLayerNormFusedOpsRedundantCast_CUDA) {
std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
auto fusion = fusion_ptr.get();
FusionGuard fg(fusion);
const float kEps = 1e-5;
const int batch_size = 8192;
const int hidden_size = 5140;
{
DataType dtype = DataType::Half;
auto tv0 = makeContigTensor(1, dtype);
auto tv1 = makeContigTensor(2, dtype);
auto tv2 = makeContigTensor(1, dtype);
auto tv3 = makeContigTensor(1, dtype);
auto tv4 = makeContigTensor(1, dtype);
fusion->addInput(tv0);
fusion->addInput(tv1);
fusion->addInput(tv2);
fusion->addInput(tv3);
fusion->addInput(tv4);
auto tv5 = broadcast(tv0, {true, false});
auto tv6 = castOp(DataType::Float, tv1);
auto tv7 = castOp(DataType::Float, tv5);
auto tv8 = add(tv6, tv7);
auto tv9 = castOp(DataType::Half, tv8);
auto tv10 = broadcast(tv2, {true, false});
auto tv11 = castOp(DataType::Float, tv9);
auto tv12 = castOp(DataType::Float, tv10);
auto tv13 = add(tv11, tv12);
// auto tv14 = castOp(DataType::Half, tv13);
// auto tv15 = castOp(DataType::Float, tv14);
auto tv16 = variance(tv13, {1}, false, false);
auto tv17 = broadcast(tv16, {false, true});
auto tv18 = sum(tv13, {1}, false);
auto tv19 = broadcast(tv18, {false, true});
nvfuser::Val* num_features =
IrBuilder::create<Double>(1, dtype = DataType::Double);
num_features = mul(num_features, tv0->getLeafDomain()[0]->extent());
auto s20 = num_features;
auto s21 = reciprocal(s20);
auto tv22 = mul(tv19, s21);
auto s23 = IrBuilder::create<Double>(kEps, dtype = DataType::Double);
auto tv24 = add(tv17, s23);
auto tv25 = rsqrt(tv24);
auto tv26 = broadcast(tv22, {false, false});
// auto tv27 = castOp(DataType::Float, tv14);
auto tv28 = sub(tv13, tv26);
auto tv29 = broadcast(tv25, {false, false});
auto tv30 = mul(tv28, tv29);
auto tv31 = broadcast(tv4, {true, false});
auto tv32 = castOp(DataType::Float, tv31);
auto tv33 = mul(tv30, tv32);
auto tv34 = broadcast(tv3, {true, false});
auto tv35 = castOp(DataType::Float, tv34);
auto tv36 = add(tv33, tv35);
auto tv37 = castOp(DataType::Half, tv36);
fusion->addOutput(tv37);
}
auto options = at::TensorOptions().dtype(at::kHalf).device(at::kCUDA, 0);
std::vector<c10::IValue> inputs;
std::vector<at::Tensor> outputs;
{
auto t0 = at::randn({hidden_size}, options);
auto t1 = at::randn({batch_size, hidden_size}, options);
auto t2 = at::randn({hidden_size}, options);
auto t3 = at::randn({hidden_size}, options);
auto t4 = at::randn({hidden_size}, options);
inputs.emplace_back(t0);
inputs.emplace_back(t1);
inputs.emplace_back(t2);
inputs.emplace_back(t3);
inputs.emplace_back(t4);
auto t5 = t0.unsqueeze(0).expand({batch_size, hidden_size});
auto t6 = t1.to(at::kFloat);
auto t7 = t5.to(at::kFloat);
auto t8 = at::add(t6, t7);
auto t9 = t8.to(at::kHalf);
auto t10 = t2.unsqueeze(0).expand({batch_size, hidden_size});
auto t11 = t9.to(at::kFloat);
auto t12 = t10.to(at::kFloat);
auto t13 = at::add(t11, t12);
auto t14 = t13.to(at::kHalf);
auto aten_outputs = at::native_layer_norm(t14, {hidden_size}, t4, t3, kEps);
auto t33 = std::get<0>(aten_outputs);
outputs.emplace_back(t33);
}
FusionExecutorCache fec(std::move(fusion_ptr));
auto cg_outputs = fec.runFusionWithInputs(inputs);
testValidate(fusion, cg_outputs, inputs, outputs, __LINE__, __FILE__);
}
(3) Result with PYTORCH_NVFUSER_DUMP=ptxas_verbose,dump_eff_bandwidth,launch_param ./nvfuser_tests --gtest_filter=NVFuserTest.FusionLayerNormFusedOpsRedundantCast_CUDA
ptxas info : 307 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 256 bytes stack frame, 248 bytes spill stores, 296 bytes spill loads
ptxas info : Used 72 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.339968 ms, achieved: 495.543 GB/s
(4) Result with PYTORCH_NVFUSER_MAX_REG_COUNT=255 PYTORCH_NVFUSER_DUMP=ptxas_verbose,dump_eff_bandwidth,launch_param ./nvfuser_tests --gtest_filter=NVFuserTest.FusionLayerNormFusedOpsRedundantCast_CUDA
ptxas info : 307 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 157 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.715776 ms, achieved: 235.365 GB/s
I was able to reproduce your results.
Any idea why the 255 case is so much slower than the default case?
I'm looking into the fusion. There's only one persistent buffer:
float T13[20];
But still the register usage looks terrible:
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 248 bytes stack frame, 236 bytes spill stores, 280 bytes spill loads
ptxas info : Used 72 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.321536 ms, achieved: 523.95 GB/s
Looks like it at least requires 130 registers:
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 130 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.726016 ms, achieved: 232.045 GB/s
With iter_unroll_factor
disabled, the register usage is decreased significantly:
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 45 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 8192, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.200704 ms, achieved: 839.389 GB/s
The last case only uses 45 registers, which seems pretty good considering there's a buffer of float[20]
.
With iter_unroll_factor==2
, the register count jumps to 130, which seems too much.
I was able to reproduce your results.
Any idea why the 255 case is so much slower than the default case?
Maybe due to low occupancy. 255 case only can have 1 active block per SM while original case with 72 registers can have 3 blocks per SM.
One reason seems to be this unswitch:
https://github.com/NVIDIA/Fuser/blob/main/csrc/scheduler/reduction_utils.cpp#L256
Unswitching the iteration domain means unswitching the whole reduction domain, which doesn't work well unless the extent is completely evenly divisible, which is not the case with this problem size. So, aside from the register usage, it is understandable that the efficiency gets reduced.
After disabling it, here's what I got:
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 40 bytes stack frame, 68 bytes spill stores, 148 bytes spill loads
ptxas info : Used 72 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.207872 ms, achieved: 810.444 GB/s
Interestingly, the register usage is also reduced significantly. Looks like 88 registers are enough for this kernel.
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 88 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.246784 ms, achieved: 682.656 GB/s
It seems to align well with the no iter-unroll-factor case, which uses 45 registers. With iter_unroll_factor==2
, the persistent buffer size is effectively doubled, so additional 20 registers. Moreover, since we unroll the iteration domain, the input cache is allocated as:
Array<__half, 40, 4> T39;
That's another 20 registers. Overall, 45 + 20 + 20 = 85, so that's pretty close to the actual usage, which is good. However, it also means there's almost no benefit of increasing the per-thread work as the register usage efficiency is virtually improved.
The occupancy is 2 blocks per SM with this configuration, whereas it's 4 with no iter unroll. Since the former has 2x more work per thread than the latter, it seems our generated code is not able to exploit instruction-level parallelism well. Here's what ncu shows about the number of ready warps with iter unroll:
Eligible Warps Per Scheduler [warp] 0.71
Here's the number of ready warps with no iter unroll:
Eligible Warps Per Scheduler [warp] 1.29
This is disappointing. I suspect the input reading and the main computation loop are not overlapped well, whereas in the latter config, the warp scheduler should automatically switch between active warps.
I also tried to disable unrolling the iter domain while still splitting out by a factor of 2 for each thread. That resulted in reduced register usage, but the performance wasn't improved much as the stall stat was still as bad as the unroll case:
ptxas info : 3 bytes gmem
ptxas info : Compiling entry function '_ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_' for 'sm_80'
ptxas info : Function properties for _ZN11CudaCodeGen7kernel1ENS_6TensorINS_6__halfELi1ELi1EEENS0_IS1_Li2ELi2EEES2_S2_S2_S3_
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 68 registers, 16 bytes smem, 464 bytes cmem[0], 8 bytes cmem[2]
Launch Parameters: BlockDim.x = 288, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 4096, GridDim.y = -1, GridDim.z = -1, Smem Size = 1152
kernel1 run in 0.234496 ms, achieved: 718.429 GB/s
Overall, in this case, using iter unroll does not improve the register efficiency and fails to extract ILP, which explains the gap between 682 GB/s and 830 GB/s. As shown above, limiting the register usage to 72 increases the occupancy to 3 blocks, which improves the performance to almost on par with the no unroll case, however, it is clear the no unroll approach makes the most sense.
So, I think we can say this removal makes sense. Another thing we may want to change is removing this unswitch. I think I found it didn't work well for outer grid normalization, so that's why it has !is_outer_grid_persistence
. I'll check if it should be globally disabled.
Thanks for further investigation. what is "completely evenly divisible"? Does it means elements = iter_unroll_factor^N
I think the conclusions and actions we need to follow is:
(1) remvoe iter_unroll_factor
in #386 is good
(2) Check other schedulers using iter_unroll_factor
(3) if the scheduler needs iter_unroll_factor
check unswitch
anything else?
what is "completely evenly divisible"? Does it means elements = iter_unroll_factor^N
No. The split transformations need to be divisible. Otherwise, the unswitch condition would be false for some threads, and due to the BSP nature, all threads would be blocked by the threads taking the slow path. Look at the actual generated code.
I think the conclusions and actions we need to follow is: (1) remvoe
iter_unroll_factor
in #386 is good (2) Check other schedulers usingiter_unroll_factor
(3) if the scheduler needsiter_unroll_factor
checkunswitch
anything else?
Yes. I'll take care of (2) and (3)
This is on a pytorch-A100 node:
The last four cases seem slow, in particular
LayerNormFusedOp_fp16/8192/16384
is only 287 GB/s. Is this expected? Can we improve the performance by adjusting heuristics? Here's ptxas info for the case: