RuntimeError: Cuda error on A10G 24Gb

constantm commented 1 year ago

Hi there,

Your work is amazing! After success with nvdiffrec, I tried to get nvdiffrecmc running on an A10G 24GB (AWS g5.2xlarge instance). As with nvdiffrec, I've been running nvdiffrecmc in the Docker container, built with the Dockerfile supplied in the repo. Unfortunately, it fails with the following error:

RuntimeError: Cuda error: 1[cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly);]

The full log is as follows:

Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu118/optixutils_plugin...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=optixutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/root/nvdiffrec/render/optixutils/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /root/nvdiffrec/render/optixutils/c_src/denoising.cu -o denoising.cuda.o
[2/4] c++ -MMD -MF optix_wrapper.o.d -DTORCH_EXTENSION_NAME=optixutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/root/nvdiffrec/render/optixutils/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -DNVDR_TORCH -c /root/nvdiffrec/render/optixutils/c_src/optix_wrapper.cpp -o optix_wrapper.o
[3/4] c++ -MMD -MF torch_bindings.o.d -DTORCH_EXTENSION_NAME=optixutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/root/nvdiffrec/render/optixutils/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -DNVDR_TORCH -c /root/nvdiffrec/render/optixutils/c_src/torch_bindings.cpp -o torch_bindings.o
[4/4] c++ denoising.cuda.o optix_wrapper.o torch_bindings.o -shared -lcuda -lnvrtc -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o optixutils_plugin.so
Loading extension module optixutils_plugin...
Config / Flags:
---------
iter 1000
batch 2
spp 1
layers 1
train_res [800, 800]
display_res [800, 800]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.005]
custom_mip False
background white
loss logl1
out_dir out/out
config /root/nvdiffrec/nerf_dataset/config.json
ref_mesh nerf_dataset
base_mesh None
validate False
n_samples 2
bsdf pbr
denoiser bilateral
denoiser_demodulate True
mtl_override None
dmtet_grid 128
mesh_scale 2.4
envlight None
env_scale 1.0
probe_res 256
learn_lighting True
display None
transparency False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative
laplace_scale 3000.0
pre_load True
no_perturbed_nrm False
decorrelated False
kd_min [0.0, 0.0, 0.0, 0.0]
kd_max [1.0, 1.0, 1.0, 1.0]
ks_min [0.0, 0.08, 0.0]
ks_max [0.0, 1.0, 1.0]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
clip_max_norm 0.0
cam_near_far [0.1, 1000.0]
lambda_kd 0.1
lambda_ks 0.05
lambda_nrm 0.025
lambda_nrm2 0.25
lambda_chroma 0.0
lambda_diffuse 0.15
lambda_specular 0.0025
random_textures True
envmap data/irrmaps/aerodynamics_workshop_2k.hdr
---------
DatasetNERF: 100 images with shape [800, 800]
DatasetNERF: 100 images with shape [800, 800]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:484: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2984.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
End of OptiXStateWrapper
Encoder output: 32 dims
Using /root/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu118/renderutils_plugin/build.ninja...
Building extension module renderutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF common.o.d -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -DNVDR_TORCH -c /root/nvdiffrec/render/renderutils/c_src/common.cpp -o common.o
[2/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /root/nvdiffrec/render/renderutils/c_src/mesh.cu -o mesh.cuda.o
[3/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /root/nvdiffrec/render/renderutils/c_src/normal.cu -o normal.cuda.o
[4/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /root/nvdiffrec/render/renderutils/c_src/loss.cu -o loss.cuda.o
[5/7] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -DNVDR_TORCH -std=c++14 -c /root/nvdiffrec/render/renderutils/c_src/bsdf.cu -o bsdf.cuda.o
[6/7] c++ -MMD -MF torch_bindings.o.d -DTORCH_EXTENSION_NAME=renderutils_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -DNVDR_TORCH -c /root/nvdiffrec/render/renderutils/c_src/torch_bindings.cpp -o torch_bindings.o
[7/7] c++ mesh.cuda.o loss.cuda.o bsdf.cuda.o normal.cuda.o common.o torch_bindings.o -shared -lcuda -lnvrtc -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o renderutils_plugin.so
Loading extension module renderutils_plugin...
Traceback (most recent call last):
  File "train.py", line 639, in <module>
    geometry, mat = optimize_mesh(denoiser, glctx, glctx_display, geometry, mat, lgt, dataset_train, dataset_validate, FLAGS, pass_idx=0, pass_name="dmtet_pass1",
  File "train.py", line 398, in optimize_mesh
    result_image, result_dict = validate_itr(glctx_display, prepare_batch(next(v_it), FLAGS.train_res, FLAGS.background),
  File "train.py", line 210, in validate_itr
    buffers = render.render_mesh(FLAGS, glctx, opt_mesh, target['mvp'], target['campos'], target['light'] if lgt is None else lgt, target['resolution'],
  File "/root/nvdiffrec/render/render.py", line 310, in render_mesh
    rast, rast_db = peeler.rasterize_next_layer()
  File "/opt/conda/lib/python3.8/site-packages/nvdiffrast/torch/ops.py", line 378, in rasterize_next_layer
    result = _rasterize_func.apply(self.raster_ctx, self.pos, self.tri, self.resolution, self.ranges, self.grad_db, self.peeling_idx)
  File "/opt/conda/lib/python3.8/site-packages/nvdiffrast/torch/ops.py", line 246, in forward
    out, out_db = _get_plugin(gl=True).rasterize_fwd_gl(raster_ctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
RuntimeError: Cuda error: 1[cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly);]
OptiXStateWrapper destructor

I initially thought it might be a VRAM issue, hence why the batch size is so low. However, logging the memory it seems this is not the issue. Memory usage before and after crash:

56.17 W, 0 %, 21, 855 MiB, 21876 MiB
56.14 W, 0 %, 21, 855 MiB, 21876 MiB
56.53 W, 0 %, 21, 792 MiB, 21939 MiB
56.37 W, 0 %, 21, 0 MiB, 22731 MiB
56.40 W, 0 %, 21, 0 MiB, 22731 MiB
22.29 W, 0 %, 19, 0 MiB, 22731 MiB

Could this possibly be a compatibility issue with the A10G?

For additional context, I've also tried running nvdiffrecmc on the nvdiffrec Docker container that I've previously has success with. The only difference is the image used - nvdiffrecmc uses nvcr.io/nvidia/pytorch:22.10-py3, while nvdiffrec uses nvcr.io/nvidia/pytorch:22.07-py3. When running nvdiffrecmc on the pytorch:22.07-py3 images, it doesn't give this error. However, the textures are all blank and the geometry is pretty broken, even on batch sizes of 8:

constantm commented 1 year ago

I think this has to do with the TCNN_CUDA_ARCHITECTURES variable I used for building the image to use in AWS ECS. Will report back shortly.

jmunkberg commented 1 year ago

Hello,

We use Optix 7.3 to trace rays in nvdiffrecmc to evaluate shading, is that supported in your GPU instance? If it is not supported, shading will be black (and/or code could crash). More discussion here: https://github.com/NVlabs/nvdiffrecmc/issues/13. Our old code, nvdiffrec https://github.com/NVlabs/nvdiffrec does not require OptiX.
The error you are seeing is nvdiffrast trying to setup an OpenGL context which may not be supported on all server GPU instances. They recently released a cuda backend, https://nvlabs.github.io/nvdiffrast/#rasterizing-with-cuda-vs-opengl-new so you could try replacing this line https://github.com/NVlabs/nvdiffrecmc/blob/main/train.py#L584: glctx = dr.RasterizeGLContext() # Context for training with glctx = dr.RasterizeCudaContext()

constantm commented 1 year ago

Thanks for helping out here @jmunkberg! I've since been able to get bob rendered on an EC2 A10G instance, so that tells me the GPU is fine running Optix. However, I'm still struggling to get it running on an A10G instance on AWS Batch which is weird. I'll try swapping out the OpenGL context for a CUDA one and report back.

constantm commented 1 year ago

Yay! Swapping out to a CUDA context for rendering seems to have done the trick, thank you @jmunkberg. So strange that it works on EC2 but not Batch. My guess is that it has something to do with how the Batch service initiates the container.

constantm commented 1 year ago

Hmmm looks like I spoke too soon. It seems like it's now running with the CUDA rendering context on an AWS Batch A10G instance, but returning blank textures and a pretty broken mesh. I'm going to run the same dataset on an EC2 instance and see what happens, since my previous test on there was just with bob. Perhaps my dataset is part of the issue. Will report back on what I find.

constantm commented 1 year ago

Okay I can confirm that this works on an A10G instance on AWS EC2, but on on AWS Batch. No idea why this is, but will continue digging. Thank you again for your help on this @jmunkberg!

NVlabs / nvdiffrecmc

RuntimeError: Cuda error on A10G 24Gb #15