Open FlimFlamm opened 4 months ago
I don't think this is an installation / import order problem. Maybe you can step through the function _get_cuda_device and see what values are around. How does what you know about what GPUs you have match with EGL's data?
I don't think this is an installation / import order problem. Maybe you can step through the function _get_cuda_device and see what values are around. How does what you know about what GPUs you have match with EGL's data?
Should have one 4090 available in the system (it's a remote server so possibly lacking a display is related?)
Added comments to describe the results:
def _get_cuda_device(requested_device_id: int):
"""
Find an EGL device with a given CUDA device ID.
Args:
requested_device_id: The desired CUDA device ID, e.g. "1" for "cuda:1".
Returns:
EGL device with the desired CUDA ID.
"""
# `requested_device == 1` <--------------------
num_devices = egl.EGLint()
# num_devices.value == 0 <--------------------
if (
# pyre-ignore Undefined attribute [16]
not egl.eglQueryDevicesEXT(0, None, ctypes.pointer(num_devices))
or num_devices.value < 1
):
raise RuntimeError("EGL requires a system that supports at least one device.")
# num_devices.value == 1 (not sure why it changes before and after this if statement)
devices = (egl.EGLDeviceEXT * num_devices.value)() # array of size num_devices
# len(devices) == 1 <--------------------
if (
# pyre-ignore Undefined attribute [16]
not egl.eglQueryDevicesEXT(
num_devices.value, devices, ctypes.pointer(num_devices)
)
or num_devices.value < 1
):
raise RuntimeError("EGL sees no available devices.")
if len(devices) < requested_device_id + 1:
raise ValueError(
f"Device {requested_device_id} not available. Found only {len(devices)} devices."
)
# num_devices.value == 1 <--------------------
# Iterate over all the EGL devices, and check if their CUDA ID matches the request.
for device in devices:
available_device_id = egl.EGLAttrib(ctypes.c_int(-1))
# available_device_id.contents.value == -1 <--------------------
# pyre-ignore Undefined attribute [16]
egl.eglQueryDeviceAttribEXT(device, EGL_CUDA_DEVICE_NV, available_device_id)
if available_device_id.contents.value == requested_device_id:
return device
raise ValueError(
f"Found {len(devices)} CUDA devices, but none with CUDA id {requested_device_id}."
)
It's finding a device, somehow, but its index is -1...
Going to try this on my local station soon as I get a chance to try and eliminate/narrow display headlessness as related.
EDIT: Finding some sources claiming EGL requires a display. fingers crossed they're wrong XD
❓ How to properly use MeshRasterizerOpenGL
I'm looking for help/guidance (or a pointer to sample usage code!) regarding the MeshRasterizerOpenGL...
I'm multiview-rendering a large number of meshes (many of them are very large), and I was hoping to speed up the processes with the OpenGL rasterizer, which is said to be faster for large meshes and for multi-render scenarios.
The error I'm currently stuck on is rather confusing: can't find CUDA device with index 0 (the index it does find is -1 apparently)
The problem seems to be EGL related, which is where I'm hoping for some guidance. According to the docs it doesn't seem like I need to be creating any EGL contexts manually, and that I should be able to just hot-swap the MeshRasterizer for the MeshRasterizerOpenGL. I suspect that my code is either misusing one or more classes, that I have a version compatibility issue somewhere, or that I have run afoul of some subtle issue (like not importing openGL before pytorch3d?)
Here are the relevant versions
conda:
pip:
Here is some sample code to show how i'm using the MeshRasterizerOpenGL class:
Any pointers at all would be much appreciated; if info relevant to the issue is missing, please don't hesitate to ask and I'll provide it ASAP.