Abort during Rnning validation

hzhshok commented 2 years ago

Hello, I got error at Running validation phase, the following is the log, thank you someone for your help!

i have changed the 'bach' to 1 to avoid memory error,

configs/nerf_chair.json: { "ref_mesh": "data/nerf_synthetic/chair", "random_textures": true, "iter": 5000, "save_interval": 100, "texture_res": [ 2048, 2048 ], "train_res": [800, 800], "batch": 1, "learning_rate": [0.03, 0.01], "ks_min" : [0, 0.08, 0.0], "dmtet_grid" : 128, "mesh_scale" : 2.1, "laplace_scale" : 3000, "display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}], "background" : "white", "out_dir": "nerf_chair" }

Running command: python3 train.py --config configs/nerf_chair.jsonf

Hardware: "Quadro RTX 3000" : 6G. System: UBUNTU20.04 + 64G

Error log: Running validation MSE, PSNR 0.00113444, 29.651 Traceback (most recent call last): File "train.py", line 594, in base_mesh = xatlas_uvmap(glctx, geometry, mat, FLAGS) File "/home/jinshui/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "train.py", line 115, in xatlas_uvmap mask, kd, ks, normal = render.render_uv(glctx, new_mesh, FLAGS.texture_res, eval_mesh.material['kd_ks_normal']) File "/home/jinshui/workshop/prj/3d/nvdiffrec/render/render.py", line 266, in renderuv rast, = dr.rasterize(ctx, uv_clip4, mesh.t_tex_idx.int(), resolution) File "/home/jinshui/.local/lib/python3.8/site-packages/nvdiffrast-0.2.7-py3.8.egg/nvdiffrast/torch/ops.py", line 250, in rasterize return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1) File "/home/jinshui/.local/lib/python3.8/site-packages/nvdiffrast-0.2.7-py3.8.egg/nvdiffrast/torch/ops.py", line 184, in forward out, out_db = _get_plugin().rasterize_fwd(glctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx) RuntimeError: Cuda error: 801[cudaGraphicsGLRegisterImage(&s.cudaColorBuffer[i], s.glColorBuffer[i], GL_TEXTURE_3D, cudaGraphicsRegisterFlagsReadOnly);] terminate called after throwing an instance of 'c10::Error' what(): Cuda error: 709[cudaGraphicsUnregisterResource(s.cudaColorBuffer[i]);] Exception raised from rasterizeReleaseBuffers at /home/jinshui/.local/lib/python3.8/site-packages/nvdiffrast-0.2.7-py3.8.egg/nvdiffrast/common/rasterize.cpp:581 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f45a986f7d2 in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7f45a986be6b in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: rasterizeReleaseBuffers(int, RasterizeGLState&) + 0x25e (0x7f44d8e610e9 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #3: RasterizeGLStateWrapper::~RasterizeGLStateWrapper() + 0x37 (0x7f44d8ebcf85 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #4: std::default_delete::operator()(RasterizeGLStateWrapper) const + 0x26 (0x7f44d8ea29a6 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #5: std::unique_ptr<RasterizeGLStateWrapper, std::default_delete >::~unique_ptr() + 0x56 (0x7f44d8e97392 in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #6: + 0xc4b2c (0x7f44d8e90b2c in /home/jinshui/.cache/torch_extensions/py38_cu113/nvdiffrast_plugin/nvdiffrast_plugin.so) frame #7: + 0x1f5b20 (0x7f45b871ab20 in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x1f6cce (0x7f45b871bcce in /home/jinshui/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #9: python3() [0x5d1ea8] frame #10: python3() [0x5a958d]

frame #12: python3() [0x6aa1ba] frame #13: python3() [0x4ef8d8] frame #19: __libc_start_main + 0xf3 (0x7f45bfa250b3 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) Regards

jmunkberg commented 2 years ago

Sorry to hear you are having issues. Perhaps try to change this line in the config "texture_res": [ 2048, 2048 ], to "texture_res": [ 512, 512 ],

I haven't tried on a GPU with 6GB memory, but the problems seems to be happening in render.render_uv where we convert from volumetric textures to 2D textures by rasterizing in texture space. The resolution we are rasterizing at is determined by texture_res, so perhaps it will work with a lower value, if it is a memory issue you are seeing.

Also, to verify, could you run the simpler examples, say bob.json or spot.json (with reduced batch size) without issues?

Unfortunately, I think 6GB is not sufficient to get high quality results from the current code base. We have only tested the code using 12+GB GPUs

hzhshok commented 2 years ago

Thanks @jmunkberg for you response, i am not sure what the reason that aborted after i make this feature run, but i guessed it maybe be i did not update torch(cuda) environment for my downgrade to cuda11.3(including cudnn) to match torch.

But i still have memory issue for nerf_ship this time, but i can share some experience during make module running.

a. The environment must be matched version for cuda(cudnn) <-> torch, for example, cuda_11.3 <-> matched torch. This maybe is why this abort issue, i maybe forgot to update cuda link and cuda environment.

b. GPU memory. I use Quadro 3000(6g), i must set the batch to 1 to make this system work, of course, different GPU(type, gefore or Quadro rtx) has different model effect. If it still can't solve memory issue, maybe jmunkberg's answer is the option.

Regards

NVlabs / nvdiffrec

Abort during Rnning validation #4