Open yuezhu1 opened 5 months ago
if MIG is supported, we need to add MIG support for energy consumption as well. Need to carefully think how MIG energy can be estimated if only using one of the many MIG slices on a GPU.
Hi @rinana , would you mind using this issue for us to discuss how we should enable energy estimation for MIG. I feel the most difficult question for us to answer is how we can know the other running MIGs on the same GPU.
Thanks @rinana. This would be awesome! We can set the default energy metrics for MIG as Kepler MIG metrics. I wonder if you have time to enable MIG energy with Kepler, since I don’t have any Kepler enabled environment on hand.
Hi Yue (@yuezhu1), Keeping energy deployments on the side for now we have been able to leverage fmperf
for MIG deployments.
An example test would be:
kubectl label nodes $NODE nvidia.com/mig.config=all-3g.40gb --overwrite
kubectl describe node $NODE
you should see a label on the node along the following lines:
"nvidia.com/gpu.product": "A100-SXM4-80GB-MIG-3g.40gb",
cluster_gpu_name
here to A100-SXM4-80GB-MIG-3g.40gb
one should be all set?Is that not the case? Am I missing something?
@rohanarora
As Yue mentioned in the first post of this issue,
we need to replace "nvidia.com/gpu": str(model.num_gpus)
to "nvidia.com/mig-3g.40gb": "1"
to use 3g.40gb MIG.
see here
This value depends on the MIG profile size.
I see. Thanks for pointing that out.
For simple cases where there is homogeneity, just setting to "nvidia.com/gpu": str(model.num_gpus)
to 1
and having the nodeSelector as shown here seems suffice to leverage a MIG.
PS: Requesting a MIG using nodeSelector (while using single strategy) v/s requesting MIG using resource limit (while mixed strategy)
Current
Cluster
deployment only allows inference servers to be deployed on GPU see here We also want to support servers to be deployed on MIG (and possibly other accelerators later).