NVlabs / nvdiffrec

Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".
Other
2.1k stars 222 forks source link

Abort during Rnning validation #4

Closed hzhshok closed 2 years ago

hzhshok commented 2 years ago

Hello, I got error at Running validation phase, the following is the log, thank you someone for your help!

i have changed the 'bach' to 1 to avoid memory error,

configs/nerf_chair.json: { "ref_mesh": "data/nerf_synthetic/chair", "random_textures": true, "iter": 5000, "save_interval": 100, "texture_res": [ 2048, 2048 ], "train_res": [800, 800], "batch": 1, "learning_rate": [0.03, 0.01], "ks_min" : [0, 0.08, 0.0], "dmtet_grid" : 128, "mesh_scale" : 2.1, "laplace_scale" : 3000, "display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}], "background" : "white", "out_dir": "nerf_chair" }

Running command: python3 train.py --config configs/nerf_chair.jsonf

Hardware: "Quadro RTX 3000" : 6G. System: UBUNTU20.04 + 64G

Error log: Running validation MSE, PSNR 0.00113444, 29.651 Traceback (most recent call last): File "train.py", line 594, in base_mesh = xatlas_uvmap(glctx, geometry, mat, FLAGS) File "/home/jinshui/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "train.py", line 115, in xatlas_uvmap mask, kd, ks, normal = render.render_uv(glctx, new_mesh, FLAGS.texture_res, eval_mesh.material['kd_ks_normal']) File "/home/jinshui/workshop/prj/3d/nvdiffrec/render/render.py", line 266, in renderuv rast, = dr.rasterize(ctx, uv_clip4, mesh.t_tex_idx.int(), resolution) File "/home/jinshui/.local/lib/python3.8/site-packages/nvdiffrast-0.2.7-py3.8.egg/nvdiffrast/torch/ops.py", line 250, in rasterize return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1) File "/home/jinshui/.local/lib/python3.8/site-packages/nvdiffrast-0.2.7-py3.8.egg/nvdiffrast/torch/ops.py", line 184, in forward out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx) RuntimeError: Cuda error: 801[cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly);] terminate called after throwing an instance of 'c10::Error' what(): Cuda error: 709[cudaGraphicsUnregisterResource(s.cudaColorBuffer[i]);] Exception raised from rasterizeReleaseBuffers at /home/jinshui/.local/lib/python3.8/site-packages/nvdiffrast-0.2.7-py3.8.egg/nvdiffrast/common/rasterize.cpp:581 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f45a986f7d2 in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7f45a986be6b in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: rasterizeReleaseBuffers(int, RasterizeGLState&) + 0x25e (0x7f44d8e610e9 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #3: RasterizeGLStateWrapper::~RasterizeGLStateWrapper() + 0x37 (0x7f44d8ebcf85 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #4: std::default_delete::operator()(RasterizeGLStateWrapper) const + 0x26 (0x7f44d8ea29a6 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #5: std::unique_ptr<RasterizeGLStateWrapper, std::default_delete >::~unique_ptr() + 0x56 (0x7f44d8e97392 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #6: + 0xc4b2c (0x7f44d8e90b2c in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #7: + 0x1f5b20 (0x7f45b871ab20 in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x1f6cce (0x7f45b871bcce in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #9: python3() [0x5d1ea8] frame #10: python3() [0x5a958d]

frame #12: python3() [0x6aa1ba] frame #13: python3() [0x4ef8d8] frame #19: __libc_start_main + 0xf3 (0x7f45bfa250b3 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) Regards
jmunkberg commented 2 years ago

Sorry to hear you are having issues. Perhaps try to change this line in the config "texture_res": [ 2048, 2048 ], to "texture_res": [ 512, 512 ],

I haven't tried on a GPU with 6GB memory, but the problems seems to be happening in render.render_uv where we convert from volumetric textures to 2D textures by rasterizing in texture space. The resolution we are rasterizing at is determined by texture_res, so perhaps it will work with a lower value, if it is a memory issue you are seeing.

Also, to verify, could you run the simpler examples, say bob.json or spot.json (with reduced batch size) without issues?

Unfortunately, I think 6GB is not sufficient to get high quality results from the current code base. We have only tested the code using 12+GB GPUs

hzhshok commented 2 years ago

Thanks @jmunkberg for you response, i am not sure what the reason that aborted after i make this feature run, but i guessed it maybe be i did not update torch(cuda) environment for my downgrade to cuda11.3(including cudnn) to match torch.

But i still have memory issue for nerf_ship this time, but i can share some experience during make module running.

a. The environment must be matched version for cuda(cudnn) <-> torch, for example, cuda_11.3 <-> matched torch. This maybe is why this abort issue, i maybe forgot to update cuda link and cuda environment.

b. GPU memory. I use Quadro 3000(6g), i must set the batch to 1 to make this system work, of course, different GPU(type, gefore or Quadro rtx) has different model effect. If it still can't solve memory issue, maybe jmunkberg's answer is the option.

Regards