SovereignCloudStack / standards

SCS standards in a machine readable format
https://scs.community/
Creative Commons Attribution Share Alike 4.0 International
34 stars 24 forks source link

[Standardization] GPU naming convention needs further refinements #366

Closed anjastrunk closed 3 weeks ago

anjastrunk commented 1 year ago

GPU name in SCS Flavor Naming Standard need further refinement. The following aspects are described insufficiently:

garloff commented 1 year ago
  • [ ] GPU capabilities are added to flavor name as extension. See Complete Proposal for systematic flavor naming. However, the order of extensions is unclear. As there are three extensions currently, I would make sense and facilitate parsing/mapping if extensions have a dedicated order.

The order is and always has been fixed. A sentence to make this clear is added with PR #374.

garloff commented 1 year ago

[ ] GPU naming supports suffix h, which can be set multiple times indicating a high-performance GPU. However "high-performance" is neither explained in detail, nor there is a mapping from h, hh, hhh... to measurable a GPU property as it is done for CPU Frequency`. In favor of comparison and interoperability, standard for GPU naming SHOULD be very strict and clear here.

Agreed. The wording suggests to use it for HBM memory, which I think is well-defined (and meaningful, as it does make a difference.) But we don't have frequency criteria listed, mainly because this would create a large table, as the notion of "high" frequency is very much dependent on the GPU vendor and generation. So with the current wording, we allow a vendor to use the h to differentiate two different flavors where one has a higher frequency GPU than the other (but otherwise the same). This is imperfect, as vendors will have different approaches to this without us defining it, so we may need to create this table ...

The other option is to narrow things down and say that h is HBM memory, period.

garloff commented 1 year ago

[ ] There are no examples provided for flavor naming with GPU support, as is is done for CPU or Memory

True. We could easily add this. Why not use SCS-16V-64-500s_GNa-14h as an example? (This flavor exists on one of our partner clouds.) PCI pass-through Nvidia Ampere with 14 SMs and HBM memory. (It could also have been specially high freq, and I happen to know it's HBM memory.) Want to submit a PR? Want me to do it?

garloff commented 1 year ago
* [ ]  [Abbreviation SUs (Streaming Multiprocessors) and EUs (Execution Units)) are used without glossary/explanation. #375](https://github.com/SovereignCloudStack/standards/issues/375)

That can easily be addressed. I just added it to the PR #374 as it fit nicely.

garloff commented 1 year ago

Want to submit a PR? Want me to do it?

Added it also to PR #374.

garloff commented 1 year ago
* [ ]  According to SCS flavor naming, GPU generation, such as Ampere or Hopper for Nvidea, can be defined by adding appropriate suffix to GPU definition. IMO, placing generation is not sufficient as there is a huge performance difference between A40 and A100, both GPUs of generation "Ampere". Hence, we need a further refinement here, to point out GPU capabilities more precisely.

The number of SMs should give you an indication of how much performance to expect. Together with maybe the h qualifier (HBM memory). _GNa-84 (A40) vs _GNa-108h (A100).

garloff commented 1 year ago

Standard does not support definition of number of physical or virtual GPUs

True, that is a real limitation.

Another (and maybe more important) missing piece is that we don't specify the amount of VRAM that is available to the user, which may be a serious limitation. Does my 30b LLM model (in 4bit+ quantization, so it will require ~18GiB) fit or not?

So this would need a real extension: _[Ix][G/g]X[N][-M[h][-O[h]]] I is denoting the no of GPUs and -O the amount of memory (in GiB). h behind SMs/CUs/EUs would denote high freq., h behind O (memory) memory with bandwidth > 1TiB/s. This would be backwards compatible. If we wanted to allow for heterogeneous GPUs, we could allow multiple of these options. Obviously, you may be able to come up with a better proposal.

anjastrunk commented 1 year ago

Standard does not support definition of number of physical or virtual GPUs

True, that is a real limitation.

Another (and maybe more important) missing piece is that we don't specify the amount of VRAM that is available to the user, which may be a serious limitation. Does my 30b LLM model (in 4bit+ quantization, so it will require ~18GiB) fit or not?

So this would need a real extension: _[Ix][G/g]X[N][-M[h][-O[h]]] I denoting the no of GPUs and -O the amount of memory (in GiB). h behind SMs/CUs/EUs would denote high freq., h behind O (memory) memory with bandwidth > 1TiB/s. This would be backwards compatible. If we wanted to allow for heterogeneous GPUs, we could allow multiple of these options. Obviously, you may be able to come up with a better proposal.

As I feel myself not competent enough to judge this approach, I will forward the improvement of GPU definition in SCS flavor standard to our GPU expert. This may take some time.

garloff commented 11 months ago

Any feedback?

cah-patrickthiem commented 7 months ago

Just for the record. I did some research on how hyperscalers are doing the naming of GPU flavors to maybe get some inspiration or "common practices". However, neither of the big players seem to have any clear naming scheme. In the following I present my findings:

Microsoft Azure https://learn.microsoft.com/de-de/azure/virtual-machines/ncads-h100-v5

Google Cloud https://cloud.google.com/compute/docs/gpus?hl=de#a100-gpus

AWS https://aws.amazon.com/de/ec2/instance-types/

garloff commented 1 month ago

Sidenote: The multi-GPU feature is not yet included in #780.

garloff commented 1 month ago

Discussion seems to have continued on https://github.com/SovereignCloudStack/standards/pull/546

mbuechse commented 3 weeks ago

Can this be closed now thanks to #780?

mbuechse commented 3 weeks ago

I'm closing this. Please open a new one with whatever's remaining.