H100 MIG Instances failed to destroy after usage

Sipondo commented 1 year ago

Hey!

Still toying around with DCGMI, MIG and our new H100 GPU. I know that this issue is not strictly just DCGMI, but I was hoping that someone might be able to help me out. :) After using the nightly PyTorch 2.1.0 (required to support H100) to train on two MIG compute instances at the same time, the compute instances become undestroyable:

unable to destroy compute instance id  0 from gpu  0 gpu instance id  3: in use by another client
failed to destroy compute instances: in use by another client

Interestingly this issue doesn't occur when I only train on one of the compute instances simultaneously. The configuration of instances did not influence the behaviour, but this example was with a 2g.20gb and a 3g.40gb instance.

Thank you for your time!

dbeer commented 1 year ago

Sipondo - in order to destroy a compute instance, the instance cannot be in use. My guess is that the PyTorch training you ran left a process that is either using the compute instance(s) or has an open handle to them.

Sipondo commented 1 year ago

Hey dbeer, thank you for your reply! Before attempting the deletions I check for processes running via nvidia-smi but it does not list any running processes. Additionally, our A100 machines don't have this issue. Is there any command I am missing to check for processes/handles?

nikkon-dev commented 1 year ago

@Sipondo,

You could try to run lsof | grep /dev/nvidia to see which process keeps the handles open. The nvidia-smi would not show you a process if it does not have a context.

There is a difference between A100 and H100 in how MIG instances can be created - A100 requires that any MIG reconfiguration is made on a detached device, and H100 allows reconfiguring on a 'live' device. However, you cannot remove a MIG instance if some app uses it.

Sipondo commented 1 year ago

Thank you for your comment, @nikkon-dev ! Interestingly, lsof | grep /dev/nvidia does not yield any results.

Just for completeness sake, running lsof | grep nvidia shows four pids: nvidia, nvidia-modeset/kthread_q, nvidia-modeset/deferred_close_kthread_q and nvidia-persiste

When the instances are stuck this way (can't be destroyed but can run on them) I cannot reset the GPU either: nvidia-smi --gpu-reset resulting in:

The following GPUs could not be reset:
  GPU 00000000:17:00.0: In use by another client

1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.

Update: I've been able to test a pytorch non-nightly workload that runs on the H100. The issue also occurs when running this workload, which suggests that the issue isn't on their end.

dbeer commented 1 year ago

@Sipondo at this point, I'm not sure what the reason would be that you can't destroy the compute instance. I recommend ensuring that you're using a Tesla Recommended Driver, and if that doesn't work you probably need to seek NVML / nvidia-smi support. Someone who works directly on those tools should be able to better assist you if the issue persists.

bstollenvidia commented 1 year ago

Here's where the nvml/nvidia-smi forums are: https://forums.developer.nvidia.com/c/developer-tools/other-tools/system-management-and-monitoring-nvml/128

Sipondo commented 1 year ago

Thanks! Continuing the discussion over at https://forums.developer.nvidia.com/t/unable-to-destroy-compute-instances-after-running-collocated-jobs-with-mig-on-h100/262786

NVIDIA / DCGM

H100 MIG Instances failed to destroy after usage #95