facebookresearch / habitat-sim

A flexible, high-performance 3D simulator for Embodied AI research.
https://aihabitat.org/
MIT License
2.64k stars 424 forks source link

Trouble loading scene on DGX A100 MIG Instance #1744

Closed ruddradev closed 2 years ago

ruddradev commented 2 years ago

Habitat-Sim version

v0.2.1

Habitat is under active development, and we advise users to restrict themselves to stable releases. Are you using the latest release version of Habitat-Sim? Your question may already be addressed in the latest version. We may also not be able to help with problems in earlier versions because they sometimes lack the more verbose logging needed for debugging.

Main branch contains 'bleeding edge' code and should be used at your own risk.

Docs and Tutorials

Did you read the docs? https://aihabitat.org/docs/habitat-sim/

I have checked the docs, and reviewed open issues.

Did you check out the tutorials? https://aihabitat.org/tutorial/2020/

I have checked the tutorials

Perhaps your question is answered there. If not, carry on!

❓ Questions and Help

Hello Team,

I need your help in understanding where I am going wrong. I had earlier run Habitat Sim and Habitat Lab on DGX 100 on a single Dedicated GPU but now when I try to load the scene in an MIG instance, I am not able to load the scene.

I am using the Image - nvidia/cudagl:11.1.1-devel-ubuntu18.04 Habitat Sim - 0.2.1 Habitat Lab - 0.2.1

nvidia-smi returns

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|

@|   0  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                   On |
| N/A   48C    P0   191W / 400W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

1
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|

�|  0    4   0   0  |      6MiB /  9984MiB | 28      0 |  2   0    1    0    0 |
|                  |      2MiB / 16383MiB |           |                       |

P+------------------+----------------------+-----------+-----------------------+

�                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|

�|  No running processes found                                                 |

When running the default config, last few lines of the error -

�I0429 12:03:53.571182  8778 AssetAttributesManager.cpp:120] Asset attributes (cylinderWireframe : cylinderWireframe_rings_1_segments_32_halfLen_1) created and registered.
I0429 12:03:53.571211  8778 AssetAttributesManager.cpp:120] Asset attributes (icosphereSolid : icosphereSolid_subdivs_1) created and registered.
I0429 12:03:53.571236  8778 AssetAttributesManager.cpp:120] Asset attributes (icosphereWireframe : icosphereWireframe_subdivs_1) created and registered.
I0429 12:03:53.571270  8778 AssetAttributesManager.cpp:120] Asset attributes (uvSphereSolid : uvSphereSolid_rings_8_segments_16_useTexCoords_false_useTangents_false) created and registered.
I0429 12:03:53.571301  8778 AssetAttributesManager.cpp:120] Asset attributes (uvSphereWireframe : uvSphereWireframe_rings_16_segments_32) created and registered.
I0429 12:03:53.571316  8778 AssetAttributesManager.cpp:108] ::constructor : Built default primitive asset templates : 12

        I0429 12:03:53.575173  8778 SceneDatasetAttributesManager.cpp:36] File (default) not found, so new default dataset attributes created and registered.
I0429 12:03:53.575217  8778 MetadataMediator.cpp:127] ::createSceneDataset : Dataset default successfully created.

�I0429 12:03:53.575595  8778 AttributesManagerBase.h:365] <Physics Manager>::createFromJsonOrDefaultInternal : Proposing JSON name : ./data/default.physics_config.json from original name : ./data/default.physics_config.json | This file  does not exist.

�I0429 12:03:53.575654  8778 PhysicsAttributesManager.cpp:26] File (./data/default.physics_config.json) not found, so new default physics manager attributes created and registered.
I0429 12:03:53.575721  8778 MetadataMediator.cpp:212] ::setActiveSceneDatasetName : Previous active dataset  changed to default successfully.
I0429 12:03:53.575727  8778 MetadataMediator.cpp:183] ::setCurrPhysicsAttributesHandle : Old physics manager attributes  changed to ./data/default.physics_config.json successfully.

�I0429 12:03:53.575737  8778 MetadataMediator.cpp:68] ::setSimulatorConfiguration : Set new simulator config for scene/stage : dataset/Gibson/gibson/Adrian.glb and dataset : default which is currently active dataset.

�Platform::WindowlessEglApplication: eglQueryDeviceStringEXT(EGLDevice=0): EGL_NV_device_cuda EGL_EXT_device_drm EGL_EXT_device_query_name

aPlatform::WindowlessEglApplication: found 2 EGL devices, choosing EGL device 0 for CUDA device 0

�Platform::WindowlessEglApplication::tryCreateContext(): cannot initialize EGL: EGL_BAD_ACCESS
WindowlessContext: Unable to create windowless context

teglInitialize(): EGL_BAD_ACCESS error: In eglInitialize: EGLDisplay (0x62f8230): Backend failed to allocate context

When I try to comment out the error assertion in WindowlessContext.cpp, I get -

Created DrawableGroup: 

�Platform::WindowlessEglApplication: eglQueryDeviceStringEXT(EGLDevice=0): EGL_NV_device_cuda EGL_EXT_device_drm EGL_EXT_device_query_name

aPlatform::WindowlessEglApplication: found 2 EGL devices, choosing EGL device 0 for CUDA device 0

teglInitialize(): EGL_BAD_ACCESS error: In eglInitialize: EGLDisplay (0x68dc550): Backend failed to allocate context

�Platform::WindowlessEglApplication::tryCreateContext(): cannot initialize EGL: EGL_BAD_ACCESS
WindowlessContext: Unable to create windowless context

When I try to change the GPU device to 1 instead of zero in the config.yaml that loads the scene using Habitat lab

�Platform::WindowlessEglApplication: eglQueryDeviceStringEXT(EGLDevice=0): EGL_NV_device_cuda EGL_EXT_device_drm EGL_EXT_device_query_name

�Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context

�Platform::WindowlessEglApplication: eglQueryDeviceStringEXT(EGLDevice=1): EGL_MESA_device_software
eglQueryDeviceAttribEXT(): eglQueryDeviceStringEXT

When I built Habiat without the --with-cuda flag -

When I disabled the --with-cuda flag, and windowlesscontext fatal error - during build and ran


�Platform::WindowlessEglApplication: eglQueryDeviceStringEXT(EGLDevice=0): EGL_NV_device_cuda EGL_EXT_device_drm EGL_EXT_device_query_name

aPlatform::WindowlessEglApplication: found 2 EGL devices, choosing EGL device 0 for CUDA device 0

^Platform::WindowlessEglApplication::tryCreateContext(): cannot initialize EGL: EGL_BAD_ACCESS

teglInitialize(): EGL_BAD_ACCESS error: In eglInitialize: EGLDisplay (0x66749f0): Backend failed to allocate context

8eglQueryString(): Invalid enum 0x3053 without a display
erikwijmans commented 2 years ago

I'm not that familiar with MIG, but Nvidia's documentation implies that it only supports CUDA: "The new Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization." (emphasis added) https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introduction

ruddradev commented 2 years ago

Hello Erik,

Thanks for the response. An observation I had was when I tried to copy a Pytorch tensor from cpu to cuda:0 it would throw error when running in an MIG instance, as possibly the device ID is something other than 0. But when I copy it from cpu to device: cuda, it runs successfully.

When loading a scene, I see in the config, there is a field that specifies target GPU ID, is there a way we can use the first available GPU? Possibly there is some way to do it via system calls, but checking if there is an inbuilt method in Habitat?

erikwijmans commented 2 years ago

Based on the logging, it did select the first EGL device

found 2 EGL devices, choosing EGL device 0 for CUDA device 0

Since it is finding an EGL device but giving EGL_BAD_ACCESS when trying to initialize the context, my guess is that the implication from the docs that MIG only supports CUDA is unfortunately correct. So this may just be a limitation of MIG and we won't be able to get it to work :/

If you'd like to change the device selection logic, that lives here: https://github.com/mosra/magnum/blob/49bcbed2f4799e7b341975a5dde98d4ba4d288d8/src/Magnum/Platform/WindowlessEglApplication.cpp#L175-L223

ruddradev commented 2 years ago

We found a workaround by running it on a full V100 GPU. Thanks for the pointers. Hopefully it will help someone else looking to run it on an MIG. Closing the issue.