Parallel Rendering through PyOpenGL

nishadgothoskar commented 1 year ago

I am attempting to use glMultiDrawElementsIndirect to implement parallel batched rendering using PyOpenGL and am running into a variety of issues, mostly due to my inexperience with OpenGL. I am following your implementation in nvdiffrast/common/rasterize_gl.cpp since its the only readable usage of glMultiDrawElementsIndirect that I've found thus far. Any help with the following issues would be much appreciated 🙏 :

I am using a multilayered image buffer (height x width x batch_size):

color_tex = glGenTextures(1)
glBindTexture(GL_TEXTURE_2D_ARRAY, color_tex)
glFramebufferTexture(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, color_tex, 0)
glTexImage3D(GL_TEXTURE_2D_ARRAY, 0, GL_RGBA32F, width, height, batch_size, 0,
                GL_RGBA, GL_UNSIGNED_BYTE, None)

and then doing rendering into it using:

indirect = np.array([
    [indices.shape[0]*3, batch_size, 0, 0, 0, 1]
    for _ in range(batch_size)
    ], dtype=np.uint32)
glMultiDrawElementsIndirect(GL_TRIANGLES,
    GL_UNSIGNED_INT,
    indirect,
    batch_size,
    indirect.dtype.itemsize * indirect.shape[1]
)

It seems to work, but the rendered images look like this:

when they are actually supposed to look like:

I can't track down the cause of this mismatch.

What's even more odd is that when I add this depth framebuffer:

glBindTexture(GL_TEXTURE_2D_ARRAY, depth_tex)
glFramebufferTexture(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, depth_tex, 0)
glTexImage3D.wrappedOperation(
    GL_TEXTURE_2D_ARRAY, 0, GL_DEPTH24_STENCIL8, width, height, batch_size, 0, 
    GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, None
);

the image now comes out correctly. But it only writes to the first layer of the multilayer framebuffer.

Any ideas on where to look next would be much appreciated!

Probably unrelated, but I saw this in the code:

    // Enable depth modification workaround on A100 and later.
    int capMajor = 0;
    NVDR_CHECK_CUDA_ERROR(cudaDeviceGetAttribute(&capMajor, cudaDevAttrComputeCapabilityMajor, cudaDeviceIdx));
    s.enableZModify = (capMajor >= 8);

I am running on an A100. Might that be related to this?

nishadgothoskar commented 1 year ago

I found in OpenGL docs that "if nodepth image is attached, Depth Testing will be disabled when rendering to this FBO." so I guess this explains why we have that mismatch. However, the question still remains how I can enable Depth Testing and also still properly render multiple images

s-laine commented 1 year ago

Did it render to all layers before you added the depth buffer? Do you clear the depth buffer in the right place? When setting the multilayer depth buffer, you're doing glFramebufferTexture(..., GL_DEPTH_ATTACHMENT, ...) which should probably have GL_DEPTH_STENCIL_ATTACHMENT instead, because that's what you set with glTexImage3D. Are you sure you're copying back all the layers of the framebuffer and not just the first one? I have no experience with PyOpenGL, so it could be that it treats input/output buffers in some way you don't expect.

The A100 thing would almost certainly look different (see issue #62).

There's also the remote possibility of a bug in the PyOpenGL wrapper — as you have noticed, these are somewhat less used API functions, and an obscure bug might not be found in a long while. If you get stuck and can't find anything wrong with your code, you may have to try replicating your test on C/C++.

nishadgothoskar commented 1 year ago

Thanks for the fast response! Ok I will fix the GL_DEPTH_STENCIL_ATTACHMENT inconsistency and give it shot. You're right, maybe its worth me implementing in C++ just to ensure that I am not hitting some bug in PyOpenGL. I will get back to you soon

nishadgothoskar commented 1 year ago

Ok after some digging, I ended up getting it to work through PyOpenGL. There were some issues with the ordering. So I matched the ordering from your code and it ended up working. However, I realized that the overhead of copying back thousands of images from GPU -> CPU dominates any speedup that I got from using this code in the first place.

So, where I am going next is just implementing the computations I want (a probabilistic likelihood calculation) using Cuda kernels. I have got this working, with just some simple extensions to your code. But am now facing some odd OpenGL issues.

/home/nishadgothoskar/jax3dp3/jax3dp3/nvdiffrast/common/rasterize_gl.cpp:436:9: error: ‘glUniform4fv’ was not declared in this scope; did you mean ‘glUniform2f’?
  436 |         glUniform4fv(0, 1, projPtr);
      |         ^~~~~~~~~~~~
      |         glUniform2f

its surprising to me that the function glUniform2f exists but not glUniform4fv.

When I print glGetString(GL_VERSION) I get 4.6.0 NVIDIA 515.86.01. From looking at OpenGL 4.6 documentation it seems that the function glUniform4fv should exist.

Any ideas why I can use that function?

s-laine commented 1 year ago

You're running into the problem of loading OpenGL functions. All modern features and function calls in OpenGL are provided as dynamic extensions rather than part of the "base" library, and these function pointers must be queried via an OS-specific mechanism. There are libraries such as GLEW for doing all of this automatically, and it is strongly recommended to use one of those. Nvdiffrast used to use GLEW prior to v0.2.5, but it was replaced with a custom solution due to certain compatibility and portability issues in specific kinds of cluster environments.

So, against all advice, nvdiffrast currently imports the needed OpenGL functions manually using code in glutil.cpp, glutil.h and glutil_extlist.h, where the last one contains a list of functions to import depending on what the OS-provided GL/gl.h has already brought in. If you want to keep using this mechanism, you'll have to manage the list of extension functions and their prototypes in glutil_extlist.h as well as any missing OpenGL constants in glutil.h. However, I'd recommend moving over to something like GLEW in the long run.

nishadgothoskar commented 1 year ago

Ok this worked for me! I am getting very close to being able to have the parallel rendering and scoring completed.

I'd like to be able to pass an array of pose matrices to the shader such that I can specify the pose of the object that I want rendered in each of the images. I initially tried this by just using uniform but realized that there are limits on the number of entries and I couldn't pass the 102444 floats I needed to specify 1024 poses. Now, I am thinking that I will need to give the shader access to these poses by putting it in a texture. Does that sound like the right approach?

s-laine commented 1 year ago

Yes, that sounds like a good approach.

NVlabs / nvdiffrast

Parallel Rendering through PyOpenGL #104