Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE

UoB-HPC / miniBUDE

A BUDE virtual-screening benchmark, in many programming models

Apache License 2.0

25 stars 13 forks source link

Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

Open colleeneb opened 2 months ago

colleeneb commented 2 months ago

Hello,

We were comparing the percentage of FP32 peak we get on H100 and A100 for miniBUDE. With the big5 input (and similarly for the bm_long input) we've been seeing results like:

	theoretical FP32	miniBUDE	percent of FP32 peak
H100	67	20	30%
A100	19.5	12	62%

We were expecting similar percentage of peaks on both A100 and H100 if the input size was big enough to saturate the GPU. Our best guess right now is that big5 / bm_long inputs aren't big enough to saturate the H100 as it's much larger than the A100. Does this match with your understanding? If so, are there any suggestions on a bigger input?

Thanks!

tom91136 commented 1 month ago

Hi,

I'm currently investigating this. Preliminary benchmarking suggests that the occupancy seems lower due to high register pressure. Controlled spilling via __launch_bounds__ seems to give some minor improvement but it's still in the range of 21~22 TFLOPs.

Now, my understanding for H100 is that it has a higher SM count but the register file size per SM is the same as A100. A brief look at the core frequency improvement from A100 to H100 gives 1665Mhz (SXM96GB) / 1275Mhz (SXM80GB) = 1.3 which seem to be in the ballpark if we factor in the additional SMs at a lower occupancy. Memory improvements is negligible as we're talking about single digit MBs worth of input and KBs worth of writes at the end).

CC @tomdeakin @addy419

tom91136 commented 1 month ago

Please do let me know if you're able to get much higher than the original figure.

tom91136 commented 1 month ago

For reference, see https://resources.nvidia.com/en-us-tensor-core table 4 on page 41: Ratio of SM Registers to FP32 Cores

colleeneb commented 1 month ago

Thanks a lot for taking a look!

If I understand correctly, then the issue is that with H100 it becomes register-bound due to less registers per core (Table 4 from the link), so the occupancy is lower and thus we can't hit the same peaks as before.

tom91136 commented 1 month ago

Yes, NVIDIA doubled the FP32 unit count but the register size remained the same which is where I suspect the bottleneck lies. We're currently trying to set something up with NVIDIA and see if they can take a look.