NVIDIA / gpu-admin-tools

GPU Admin Tools. Includes Confidential Computing controls for H100, and other functionality
14 stars 4 forks source link

Support for Hopper (H100) in MIG queries? #2

Closed Banshee1221 closed 2 weeks ago

Banshee1221 commented 2 months ago

Hi there.

I love the project. Super useful tools.

I was wondering if it was on the road map to support running MIG operations against the H100? I see in the code that there is currently a check to make sure the device is Ampere: https://github.com/NVIDIA/gpu-admin-tools/blob/main/nvidia_gpu_tools.py#L3697

Unfortunately I'm not familiar enough with low-level/hardware programming to contribute. Thanks!

pjaroszynski commented 4 weeks ago

Hi,

MIG on Hopper is purely SW driven state in the driver and it shouldn't require any additional management. On Ampere it was persistent state on the GPU where modification required a GPU reset. That's why support for it was added outside of the driver itself as well. Does this make sense?

What's you use-case for MIG toggling on hopper?

Banshee1221 commented 2 weeks ago

This does make sense. We've had users (mistakenly) enable MIG on some of the GPUs that my org offers (via VFIO passthrough). These persisted across reboot. I need to double check whether any H100s were involved, but you have me doubting that now.

Are you saying that if MIG was enabled in a VM that a client was using, that when the H100 PCIe GPU is reset (let's say the VM was destroyed and a new one was created), that the MIG status would also be reset by default?

For context, the goal is to ensure that MIG is disabled on any A100 and H100 PCIe GPUs after VM instance deletion

pjaroszynski commented 2 weeks ago

Are you saying that if MIG was enabled in a VM that a client was using, that when the H100 PCIe GPU is reset (let's say the VM was destroyed and a new one was created), that the MIG status would also be reset by default?

That's right for H100. On H100 the MIG enablement is controlled by driver and it's all gone by the time the VM is gone and GPU resets. I will close this issue based on your comments, but feel free to reopen if anything is unclear.