Open jvstme opened 2 months ago
In the context of RAM/VRAM, GB (base 10) doesn't make sense because memory is always in base 2. Most vendors use GB for GiB. This is a convention which predates GiB, e.g. NVIDIA writes that A10 has 24GB meaning 24GiB; linux reports memory in GB.
I think we should continue to use GB everywhere in the context of RAM/VRAM to avoid mismatch with most vendors.
So:
For storage, distinguishing GB and GiB is important.
This issue is quite problematic because I requested an instance with GPU with 24GB, and it created one with 22GB. I try to run requiring 24GB and it can't use the existing instance.
I think we should continue to use GB everywhere in the context of RAM/VRAM to avoid mismatch with most vendors.
For storage, distinguishing GB and GiB is important.
@r4victor, so our current policy is that dstack
always means GiB when it says "GB", right? I think we can keep this policy as long as we document it. But then it is important to make sure we always stick to it, e.g. if some provider reports storage sizes in base-10 units, we should convert them to base-2 units.
Cases 2 and 3 are apparently not related to how dstack handles GB/GiB conversions, but let me still comment on them here.
It seems like AWS returns available VRAM (22GB) instead of total GPU VRAM (24GB)?
More like AWS misreports the VRAM for L4. I compared AWS A10G and L4 instances and they both have ~22.5 GiB VRAM, as reported by nvidia-smi
. Yet AWS docs and API state that A10G is 24 GiB and L4 is 24 GB.
We can either contact AWS or just hardcode 24 GiB for L4.
The resources reported by the shim are expected to be less than physical RAM/VRAM
Then we could replace the values reported by nvidia-smi
with the values from KNOWN_GPUS
, as long as they are approximately similar. It would solve the UX issue @peterschmidt85 mentioned:
I try to run requiring 24GB and it can't use the existing instance.
Cases 2 and 3 were moved to https://github.com/dstackai/gpuhunt/issues/91 and #1523 respectively.
This issue will remain open to document that dstack uses base-2 units for everything and double-check that it is consistent with cloud providers.
When displaying instance resources,
dstack
uses GB as the unit for RAM, VRAM, and disk. However, in many cases the values shown actually represent GiB, not GB. Here are some examples:g5.xlarge
on AWS is actually 16 GiB RAM and 24 GiB VRAM, not 16 GB and 24 GB.g6.xlarge
on AWS is actually 24 GB VRAM, not 22 GB.VM.GPU.A10.1
on OCI is actually 240 GB RAM and 24 GB VRAM, not 236 GB and 22 GB as shown when it is added withdstack pool add-ssh
.This ambiguity makes it difficult for users to understand what resources they will actually get and may lead to offers being filtered out while they actually match the users' requirements.