dstackai / dstack

dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.34k stars 99 forks source link

Fix GB/GiB ambiguity #1439

Open jvstme opened 2 months ago

jvstme commented 2 months ago

When displaying instance resources, dstack uses GB as the unit for RAM, VRAM, and disk. However, in many cases the values shown actually represent GiB, not GB. Here are some examples:

  1. g5.xlarge on AWS is actually 16 GiB RAM and 24 GiB VRAM, not 16 GB and 24 GB.
    > dstack run . -b aws --gpu A10G
    [... cut for brevity ...]                               
    #  BACKEND  REGION     INSTANCE   RESOURCES                                   SPOT  PRICE    
    1  aws      us-east-1  g5.xlarge  4xCPU, 16GB, 1xA10G (24GB), 100.0GB (disk)  no    $1.006
  2. g6.xlarge on AWS is actually 24 GB VRAM, not 22 GB.
    > dstack run . -b aws --gpu L4
    [... cut for brevity ...]
    #  BACKEND  REGION     INSTANCE   RESOURCES                                 SPOT  PRICE     
    1  aws      us-east-2  g6.xlarge  4xCPU, 16GB, 1xL4 (22GB), 100.0GB (disk)  no    $0.8048
  3. VM.GPU.A10.1 on OCI is actually 240 GB RAM and 24 GB VRAM, not 236 GB and 22 GB as shown when it is added with dstack pool add-ssh.

    > dstack pool ps                                           
    Pool name  default-pool
    
    INSTANCE        BACKEND  REGION  RESOURCES                                   SPOT  PRICE  STATUS  CREATED   
    tough-kangaroo  ssh      remote  30xCPU, 236GB, 1xA10 (22GB), 33.8GB (disk)  no    $0.0   idle    1 min ago

This ambiguity makes it difficult for users to understand what resources they will actually get and may lead to offers being filtered out while they actually match the users' requirements.

r4victor commented 2 months ago

In the context of RAM/VRAM, GB (base 10) doesn't make sense because memory is always in base 2. Most vendors use GB for GiB. This is a convention which predates GiB, e.g. NVIDIA writes that A10 has 24GB meaning 24GiB; linux reports memory in GB.

I think we should continue to use GB everywhere in the context of RAM/VRAM to avoid mismatch with most vendors.

So:

  1. 16 GB and 24 GB is fine.
  2. It seems like AWS returns available VRAM (22GB) instead of total GPU VRAM (24GB)?
  3. The resources reported by the shim are expected to be less that physical RAM/VRAM (some reserved RAM/VRAM may not be reported by /proc/meminfo and nvidia-smi).

For storage, distinguishing GB and GiB is important.

peterschmidt85 commented 2 months ago

This issue is quite problematic because I requested an instance with GPU with 24GB, and it created one with 22GB. I try to run requiring 24GB and it can't use the existing instance.

jvstme commented 2 months ago

I think we should continue to use GB everywhere in the context of RAM/VRAM to avoid mismatch with most vendors.

For storage, distinguishing GB and GiB is important.

@r4victor, so our current policy is that dstack always means GiB when it says "GB", right? I think we can keep this policy as long as we document it. But then it is important to make sure we always stick to it, e.g. if some provider reports storage sizes in base-10 units, we should convert them to base-2 units.

jvstme commented 2 months ago

Cases 2 and 3 are apparently not related to how dstack handles GB/GiB conversions, but let me still comment on them here.

It seems like AWS returns available VRAM (22GB) instead of total GPU VRAM (24GB)?

More like AWS misreports the VRAM for L4. I compared AWS A10G and L4 instances and they both have ~22.5 GiB VRAM, as reported by nvidia-smi. Yet AWS docs and API state that A10G is 24 GiB and L4 is 24 GB.

We can either contact AWS or just hardcode 24 GiB for L4.

The resources reported by the shim are expected to be less than physical RAM/VRAM

Then we could replace the values reported by nvidia-smi with the values from KNOWN_GPUS, as long as they are approximately similar. It would solve the UX issue @peterschmidt85 mentioned:

I try to run requiring 24GB and it can't use the existing instance.

jvstme commented 1 month ago

Cases 2 and 3 were moved to https://github.com/dstackai/gpuhunt/issues/91 and #1523 respectively.

This issue will remain open to document that dstack uses base-2 units for everything and double-check that it is consistent with cloud providers.