Open protonu opened 6 days ago
Observations:
On running getBankConflictInfo
, I see both cases (mma with and without a stmatrix) have zero bank conflicts. The codes I used was:
KernelExecutor ke;
auto launch_constraints = LaunchParams();
ke.compile(
&fusion,
{inputs.first, inputs.second},
launch_constraints,
matmul_cparams);
auto bank_conflict_info =
getBankConflictInfo(ke.kernel(), launch_constraints);
if (bank_conflict_info.empty()) {
debug() << "===== No bank confliction =====" << std::endl;
} else {
debug() << "======= Bank confliction =======" << std::endl;
for (auto info : bank_conflict_info) {
debug() << "Expr: " << info.first->toString() << std::endl;
auto conflict = info.second;
if (conflict.first > 1) {
debug() << "input conflict: " << conflict.first << " way, ";
}
if (conflict.second > 1) {
debug() << "output conflict: " << conflict.second << " way";
}
debug() << std::endl;
}
debug() << "================================" << std::endl;
}
fusion.printKernel();
Performance sees a minor degradation:
Without stmatrix
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
36.8 156830 1 156830.0 156830.0 156830 156830 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
23.1 98654 1 98654.0 98654.0 98654 98654 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
Where stmatrix
was about 63%
With stmatrix
this falls to about 59-60%
TODO: Use nsight to look at bank conflict data.
This demonstrates the use of stmatrix in a multi-tile hopper matmul.