Closed ruddradev closed 2 years ago
I'm not that familiar with MIG, but Nvidia's documentation implies that it only supports CUDA: "The new Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization." (emphasis added) https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction
Hello Erik,
Thanks for the response. An observation I had was when I tried to copy a Pytorch tensor from cpu to cuda:0 it would throw error when running in an MIG instance, as possibly the device ID is something other than 0. But when I copy it from cpu to device: cuda, it runs successfully.
When loading a scene, I see in the config, there is a field that specifies target GPU ID, is there a way we can use the first available GPU? Possibly there is some way to do it via system calls, but checking if there is an inbuilt method in Habitat?
Based on the logging, it did select the first EGL device
found 2 EGL devices, choosing EGL device 0 for CUDA device 0
Since it is finding an EGL device but giving EGL_BAD_ACCESS
when trying to initialize the context, my guess is that the implication from the docs that MIG only supports CUDA is unfortunately correct. So this may just be a limitation of MIG and we won't be able to get it to work :/
If you'd like to change the device selection logic, that lives here: https://github.com/mosra/magnum/blob/49bcbed2f4799e7b341975a5dde98d4ba4d288d8/src/Magnum/Platform/WindowlessEglApplication.cpp#L175-L223
We found a workaround by running it on a full V100 GPU. Thanks for the pointers. Hopefully it will help someone else looking to run it on an MIG. Closing the issue.
Habitat-Sim version
v0.2.1
Habitat is under active development, and we advise users to restrict themselves to stable releases. Are you using the latest release version of Habitat-Sim? Your question may already be addressed in the latest version. We may also not be able to help with problems in earlier versions because they sometimes lack the more verbose logging needed for debugging.
Main branch contains 'bleeding edge' code and should be used at your own risk.
Docs and Tutorials
Did you read the docs? https://aihabitat.org/docs/habitat-sim/
I have checked the docs, and reviewed open issues.
Did you check out the tutorials? https://aihabitat.org/tutorial/2020/
I have checked the tutorials
Perhaps your question is answered there. If not, carry on!
❓ Questions and Help
Hello Team,
I need your help in understanding where I am going wrong. I had earlier run Habitat Sim and Habitat Lab on DGX 100 on a single Dedicated GPU but now when I try to load the scene in an MIG instance, I am not able to load the scene.
I am using the Image - nvidia/cudagl:11.1.1-devel-ubuntu18.04 Habitat Sim - 0.2.1 Habitat Lab - 0.2.1
nvidia-smi returns
When running the default config, last few lines of the error -
When I try to comment out the error assertion in WindowlessContext.cpp, I get -
When I try to change the GPU device to 1 instead of zero in the config.yaml that loads the scene using Habitat lab
When I built Habiat without the --with-cuda flag -
When I disabled the --with-cuda flag, and windowlesscontext fatal error - during build and ran