SovereignCloudStack / standards

SCS standards in a machine readable format
https://scs.community/
Creative Commons Attribution Share Alike 4.0 International
30 stars 21 forks source link

GPU-flavor-naming-refinement #546

Open cah-patrickthiem opened 3 months ago

cah-patrickthiem commented 3 months ago

This PR handles the refinement of GPU flavor naming. It clarifies things and overhauls some possible inconsistencies in the current naming convention as well as the description. Therefore, this PR introduces an update to the document: scs-0100-v3-flavor-naming.md. For reference see issue 366-GPU naming convention needs further refinements.

Note: The initial commit just added the flavor naming document in version 4.

cah-patrickthiem commented 1 month ago

After some research, see here and below, I came to following conclusions: The GPU flavor naming should be changed in a way that is more clear what to expect and most important unfortunately we cannot use a general performance indicator, which was introduced in the prior flavor naming standard. In general the current standard has 4 major problems I want to tackle here.

Derivation:

current standard:

Problem 1 - transparency

What GPU exactly is a GNa-14h?

Solution of Problem 1

but there is more...

Problem 2 - inconsistency

for Nvidia examples, see here:

for AMD examples, see here:

Different Performance Benchmarks (for more details see here and here):

Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance.

Problem 3 - other factors Architectural Differences:

Core Counts and Types:

Specialized Units:

Memory Bandwidth and Cache:

Software and Optimization:

Problem 4 - high performance indicator As indicated in the current standard, the "h" is a "high performance indicator", quote: "The optional h suffix to the compute unit count indicates high-performance (e.g. high freq or special high bandwidth gfx memory such as HBM);". This reads reasonable but has some flaws. For example: What GPUs can come with HBM memory? To name some:

The problem with this is, that "high performance" should indicate just what it says, but H100, A100 are a lot faster than V100 or P100. The same applies for MI100 vs. MI50 & MI60. That can lead to confusion on what "high performance" really means. Those lower end GPUs mentioned are not really comparable using a single "h" to indicate high performance. It could maybe help to triple the "h" indicator, meaning something like: P100 and V100 get no "h", A40 would get one "h", A100 two "hh" and H100 three "hhh".

But where to draw the line here? Also, what if new generations are released, where the performance of the new GPUs x-folds in comparison to the older generation. Another idea could be to use the "h", "hh" and "hhh" indicators always inside the same gpu generation. For example for Nvidia Ampere that would look something like this: A10 no "h", A14 "h", A30 and A40 "hh", A100 "hhh".

This approach is imo inconsistent as well, but at least can be confusing for the user and/or the ones responsible for billing those flavors.

Proposals: