GPU-flavor-naming-refinement

After some research, see here and below, I came to following conclusions: The GPU flavor naming should be changed in a way that is more clear what to expect and most important unfortunately we cannot use a general performance indicator, which was introduced in the prior flavor naming standard. In general the current standard has 4 major problems I want to tackle here.

Derivation:

current standard:

right now the standard suggests something like this: SCS-16V-64-500s_GNa-14h
lets translate it, this flavor indicates a 16 vCPU core flavor with 64gb RAM, 500gb disk and a passthrough Nvidia GPU (N) from the "Ampere" (a) generation with 14 "streaming multiprocessors (sm)" and an "h" for "high-performance"
so far so good

Problem 1 - transparency

What GPU exactly is a GNa-14h?

in regard to the corresponding comment in the standard, this flavor should imply 1/4 of a Nvidia A30 GPU with the SM number of "14" (besides that, the capital "G" implies that it is a passthrough GPU rather than a virtual GPU and thus the "1/4" indicator does not really make sense - just a mistake in the standard)
the user would just not know that it is an A30 GPU, even if we make it "14*4=56" to get the real number of SMs for this GPU, the user would still not know what to get

Solution of Problem 1

make a GPU list for SCS with all corresponding specs and maybe even extend it to vGPUs where fractions of the real SM numbers are given
"There we go, fixed!" - you might think

but there is more...

Problem 2 - inconsistency

a "streaming multiprocessor" for Nvidia and a "computing unit" for AMD CANNOT be compared, at least not in a clear and logical way... Why?
there are discrepancies between the performance of a GPU and the number of SMs or CUs or whatever
this applies for GPUs from the same vendor as well as between different vendors

for Nvidia examples, see here:

the Nvidia A100 has 108 SMs
the Nvidia H100 has 114 SMs
the Nvidia L40 has 142 SMs
ranked by performance: H100 > A100 > L40
Note: H100 is 4x faster than the A100 even though it just has 6 SMs more, the L40 is significantly slower than the A100 but has more SMs

for AMD examples, see here:

AMD MI100 has 120 CUs
AMD MI250 has 208 CUs
AMD MI250x has 220 CUs

Different Performance Benchmarks (for more details see here and here):

FP64 Performance: H100 > MI250(x) > A100 > MI100
Memory Bandwidth: H100 > MI250(x) > A100 > MI100
Tensor Performance (FP16): H100 > MI250(x) > A100 > MI100

Conclusion to inconsistency: we can see, that the H100 has significantly less SMs than the MI250(x) has CUs but outperforms the AMD counterparts x times. Therefore it is not really consistent to assume a somewhat linear or understandable relation between SMs, CUs and performance.

Problem 3 - other factors Architectural Differences:

Nvidia SMs: Each streaming multiprocessor (SM) in an Nvidia GPU contains multiple CUDA cores, Tensor cores, memory caches, and other components that handle different types of workloads. The exact configuration and capabilities of an SM can vary significantly between different Nvidia architectures (e.g., Ampere vs. Hopper).
AMD CUs: Each compute unit (CU) in an AMD GPU contains multiple stream processors (SPs), along with texture units, memory caches, and other components. The design and capabilities of CUs also vary between AMD architectures (e.g., CDNA vs. RDNA).

Core Counts and Types:

Nvidia: the number of CUDA cores per SM can vary. For example, Ampere architecture has 64 CUDA cores per SM, while Turing has 64 CUDA cores per SM as well, but with different performance characteristics
AMD: the number of stream processors per CU can also vary. For instance, RDNA2 architecture has 64 stream processors per CU, but the performance per stream processor can differ based on architectural enhancements

Specialized Units:

both Nvidia and AMD include specialized units in their architectures such as Tensor Cores in Nvidia GPUs for AI tasks or Ray Accelerators in AMD GPUs for ray tracing.
the presence and performance of these units can significantly affect overall GPU performance in specific workloads

Memory Bandwidth and Cache:

the memory architecture, including the type and amount of memory (HBM2, GDDR6, etc.), memory bandwidth, and cache sizes, can greatly influence performance
high memory bandwidth and large caches can improve performance for memory-intensive tasks

Software and Optimization:

the performance also depends on software, drivers, and how well applications are optimized for a specific GPU architecture
certain workloads may run more efficiently on one architecture due to better optimization and support in the software stack
AMD ROCm vs. NVIDIA Data Center GPU Driver & CUDA Toolkit

Problem 4 - high performance indicator As indicated in the current standard, the "h" is a "high performance indicator", quote: "The optional h suffix to the compute unit count indicates high-performance (e.g. high freq or special high bandwidth gfx memory such as HBM);". This reads reasonable but has some flaws. For example: What GPUs can come with HBM memory? To name some:

AMD MI100
AMD MI50
AMD MI60
Nvidia P100
Nvidia V100
Nvidia A100
Nvidia H100

The problem with this is, that "high performance" should indicate just what it says, but H100, A100 are a lot faster than V100 or P100. The same applies for MI100 vs. MI50 & MI60. That can lead to confusion on what "high performance" really means. Those lower end GPUs mentioned are not really comparable using a single "h" to indicate high performance. It could maybe help to triple the "h" indicator, meaning something like: P100 and V100 get no "h", A40 would get one "h", A100 two "hh" and H100 three "hhh".

But where to draw the line here? Also, what if new generations are released, where the performance of the new GPUs x-folds in comparison to the older generation. Another idea could be to use the "h", "hh" and "hhh" indicators always inside the same gpu generation. For example for Nvidia Ampere that would look something like this: A10 no "h", A14 "h", A30 and A40 "hh", A100 "hhh".

This approach is imo inconsistent as well, but at least can be confusing for the user and/or the ones responsible for billing those flavors.

Proposals:

get rid of SMs, CUs etc. and include the GPU model in the flavor name
- only accept living GPUs with ongoing support for those flavors: https://endoflife.date/nvidia-gpu
  - include a list in the flavor naming standard
  - problem: currently no available list for AMD or Intel GPUs
  - maybe we can assume that their server GPUs are all not end of life yet, since they entered the market much later then Nvidia
not sure how to handle "h", "hh" or "hhh" indicators for high performance
- proposals:
  - 1. exclude entirely
  - 2. only mark "h" indicators in one GPU generation
  - e.g. for Nvidia Ampere generation:
    - A10 gets one "h"
    - A30 and A40 get two "hh"
    - A100 gets three "hhh"
    - SCS-16V-64-500s_GN-A100-hhh
  - 3. mark "h" across generations
  - A10 gets no "h"
  - A30 and A40 get one "h"
  - A100 gets two "hh"
  - H100 gets three "hhh"
    - SCS-16V-64-500s_GN-A30-h
- I have no strong opinion here, but tending to exclude it entirely
not sure how to handle vGPU in the flavor naming since there can be up to 7 fractions for vGPUs, meaning you can slice e.g. a Nvidia A100 into e.g. 5 parts, 6 parts or 7 parts
- proposal:
  - SCS-16V-64-500s_7gN-A100 would mean this flavor is one part out of 7 parts in a A100
  - SCS-16V-64-500s_5gN-A100 would mean this flavor is one part out of 5 parts in a A100
  - concern: for virtualising an A100 GPU, we would need 6-7 flavors:
  - with 2gN-A100, 3gN-A100, 4gN-A100, 5gN-A100, 6gN-A100, 7gN-A100
  - _1gN-A100 could also be needed because of virtualized passthrough
not sure how to handle vRAM, at least there are some models, like the A100 with two configurations, there is the A100 with 40gb vRAM and also the A100 with 80 GB vRAM
- proposal:
- proposals:
  - 1. always include vRAM, also for vGPUs
  - SCS-16V-64-500s_GN-A100-40g
  - SCS-16V-64-500s_GN-A100-80g
  - SCS-16V-64-500s_2gN-A100-20g
  - SCS-16V-64-500s_3gN-A100-26,7g <-- not sure here since e.g. 1/3 of 80gb vRAM is ugly, maybe just always round off to the lesser number?
    - --> SCS-16V-64-500s_3gN-A100-26g
  - advantage: see what you get
  - 2. always include vRAM, don't split for vGPUs:
  - SCS-16V-64-500s_GN-A100-40g
  - SCS-16V-64-500s_GN-A100-80g
  - SCS-16V-64-500s_2gN-A100-80g
  - SCS-16V-64-500s_2gN-A100-40g
  - SCS-16V-64-500s_GN-A10-24g
  - 3. only include vRAM in non-base-models, don't split for vGPUs:
  - SCS-16V-64-500s_GN-A100-80g
  - SCS-16V-64-500s_2gN-A100
  - SCS-16V-64-500s_2gN-A100-80g
  - advantage: less probability of error in flavor definition

SovereignCloudStack / standards

GPU-flavor-naming-refinement #546

Derivation: