Closed constantm closed 1 year ago
I think this has to do with the TCNN_CUDA_ARCHITECTURES
variable I used for building the image to use in AWS ECS. Will report back shortly.
Hello,
We use Optix 7.3 to trace rays in nvdiffrecmc to evaluate shading, is that supported in your GPU instance? If it is not supported, shading will be black (and/or code could crash). More discussion here: https://github.com/NVlabs/nvdiffrecmc/issues/13. Our old code, nvdiffrec https://github.com/NVlabs/nvdiffrec does not require OptiX.
The error you are seeing is nvdiffrast trying to setup an OpenGL context which may not be supported on all server GPU instances. They recently released a cuda backend, https://nvlabs.github.io/nvdiffrast/#rasterizing-with-cuda-vs-opengl-new so you could try replacing this line https://github.com/NVlabs/nvdiffrecmc/blob/main/train.py#L584:
glctx = dr.RasterizeGLContext() # Context for training
with
glctx = dr.RasterizeCudaContext()
Thanks for helping out here @jmunkberg! I've since been able to get bob rendered on an EC2 A10G instance, so that tells me the GPU is fine running Optix. However, I'm still struggling to get it running on an A10G instance on AWS Batch which is weird. I'll try swapping out the OpenGL context for a CUDA one and report back.
Yay! Swapping out to a CUDA context for rendering seems to have done the trick, thank you @jmunkberg. So strange that it works on EC2 but not Batch. My guess is that it has something to do with how the Batch service initiates the container.
Hmmm looks like I spoke too soon. It seems like it's now running with the CUDA rendering context on an AWS Batch A10G instance, but returning blank textures and a pretty broken mesh. I'm going to run the same dataset on an EC2 instance and see what happens, since my previous test on there was just with bob. Perhaps my dataset is part of the issue. Will report back on what I find.
Okay I can confirm that this works on an A10G instance on AWS EC2, but on on AWS Batch. No idea why this is, but will continue digging. Thank you again for your help on this @jmunkberg!
Hi there,
Your work is amazing! After success with nvdiffrec, I tried to get nvdiffrecmc running on an A10G 24GB (AWS g5.2xlarge instance). As with nvdiffrec, I've been running nvdiffrecmc in the Docker container, built with the Dockerfile supplied in the repo. Unfortunately, it fails with the following error:
The full log is as follows:
I initially thought it might be a VRAM issue, hence why the batch size is so low. However, logging the memory it seems this is not the issue. Memory usage before and after crash:
Could this possibly be a compatibility issue with the A10G?
For additional context, I've also tried running nvdiffrecmc on the nvdiffrec Docker container that I've previously has success with. The only difference is the image used - nvdiffrecmc uses
nvcr.io/nvidia/pytorch:22.10-py3
, while nvdiffrec usesnvcr.io/nvidia/pytorch:22.07-py3
. When running nvdiffrecmc on thepytorch:22.07-py3
images, it doesn't give this error. However, the textures are all blank and the geometry is pretty broken, even on batch sizes of 8: