Open colleeneb opened 2 months ago
Hi,
I'm currently investigating this.
Preliminary benchmarking suggests that the occupancy seems lower due to high register pressure.
Controlled spilling via __launch_bounds__
seems to give some minor improvement but it's still in the range of 21~22 TFLOPs.
Now, my understanding for H100 is that it has a higher SM count but the register file size per SM is the same as A100.
A brief look at the core frequency improvement from A100 to H100 gives 1665Mhz (SXM96GB) / 1275Mhz (SXM80GB) = 1.3
which seem to be in the ballpark if we factor in the additional SMs at a lower occupancy.
Memory improvements is negligible as we're talking about single digit MBs worth of input and KBs worth of writes at the end).
CC @tomdeakin @addy419
Please do let me know if you're able to get much higher than the original figure.
For reference, see https://resources.nvidia.com/en-us-tensor-core table 4 on page 41: Ratio of SM Registers to FP32 Cores
Thanks a lot for taking a look!
If I understand correctly, then the issue is that with H100 it becomes register-bound due to less registers per core (Table 4 from the link), so the occupancy is lower and thus we can't hit the same peaks as before.
Yes, NVIDIA doubled the FP32 unit count but the register size remained the same which is where I suspect the bottleneck lies. We're currently trying to set something up with NVIDIA and see if they can take a look.
Hello,
We were comparing the percentage of FP32 peak we get on H100 and A100 for miniBUDE. With the
big5
input (and similarly for thebm_long
input) we've been seeing results like:We were expecting similar percentage of peaks on both A100 and H100 if the input size was big enough to saturate the GPU. Our best guess right now is that
big5
/bm_long
inputs aren't big enough to saturate the H100 as it's much larger than the A100. Does this match with your understanding? If so, are there any suggestions on a bigger input?Thanks!