Segmentation fault happened while calling DMTetGeometry

mush881212 commented 1 year ago

Hi Team,

Thanks for your amazing work! I try to run the program, but I get the segmentation fault happen when calling DMTetGeometry(FLAGS.dmtet_grid, FLAGS.mesh_scale, FLAGS) in train.py. I tracked down the error and found that it happens when calling ou.OptiXContext() in dmtet.py. I think the error might be happening because of calling _plugin.OptiXStateWrapper(os.path.dirname(file), torch.utils.cpp_extension.CUDA_HOME) in ops.py, but I don't know how to fix it.

I tried to reduce the batch size from 8 to 1 and the image train resolution from 512x512 to 128x128, but the problem persists. Can you give some advice on how to solve this problem?

GPU Hardware:
Nvidia A100 (32G) on server Console error:

Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/optixutils_plugin/build.ninja... Building extension module optixutils_plugin... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module optixutils_plugin... Config / Flags:

iter 5000 batch 8 spp 1 layers 1 train_res [512, 512] display_res [512, 512] texture_res [1024, 1024] display_interval 0 save_interval 100 learning_rate [0.03, 0.005] custom_mip False background white loss logl1 out_dir out/nerd_gold config configs/nerd_gold.json ref_mesh data/nerd/moldGoldCape_rescaled base_mesh None validate True n_samples 12 bsdf pbr denoiser bilateral denoiser_demodulate True mtl_override None dmtet_grid 128 mesh_scale 2.5 envlight None env_scale 1.0 probe_res 256 learn_lighting True display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}] transparency False lock_light False lock_pos False sdf_regularizer 0.2 laplace relative laplace_scale 3000.0 pre_load True no_perturbed_nrm False decorrelated False kd_min [0.03, 0.03, 0.03] kd_max [0.8, 0.8, 0.8] ks_min [0, 0.08, 0.0] ks_max [0, 1.0, 1.0] nrm_min [-1.0, -1.0, 0.0] nrm_max [1.0, 1.0, 1.0] clip_max_norm 0.0 cam_near_far [0.1, 1000.0] lambda_kd 0.1 lambda_ks 0.05 lambda_nrm 0.025 lambda_nrm2 0.25 lambda_chroma 0.025 lambda_diffuse 0.15 lambda_specular 0.0025 random_textures True

DatasetLLFF: 119 images with shape [512, 512] DatasetLLFF: auto-centering at [-0.04492672 1.3252479 1.1068335 ] DatasetLLFF: 119 images with shape [512, 512] DatasetLLFF: auto-centering at [-0.04492672 1.3252479 1.1068335 ] /opt/conda/lib/python3.8/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Cuda path /usr/local/cuda Segmentation fault (core dumped)

jmunkberg commented 1 year ago

Thanks @mush881212 ,

I suspect OptiX, which is quite sensitive to the driver installed on the machine. Our code uses OptiX 7.3, which requires an Nvidia display driver numbered 465 or higher. Perhaps verify that some standalone OptiX example from the OptiX 7.3 SDK runs fine on that machine.

One alternative may be to use our old code base, https://github.com/NVlabs/nvdiffrec, which is a similar reconstruction pipeline, but without the OptiX ray tracing step.

jmunkberg commented 1 year ago

Note also that the A100 GPU does not have any RT Cores (for ray tracing acceleration), so the ray tracing performance will be lower than what we reported in the paper (we measured on an A6000 RTX GPU).

mush881212 commented 1 year ago

Hi @jmunkberg,

I think the problem is due to the driver version, because my driver version is too old to support OptiX 7.3. I will try it on another device and upgrade the driver version. Thanks for your help!

Sheldonmao commented 1 year ago

Hi, I wonder if the problem is solved? I have the same problem with a driver version 520.61.05 on V100, wonder how to solve it.

mush881212 commented 1 year ago

Hi @Sheldonmao,

I solved this issue by updating the driver version and using an RTX3090 device instead. Driver version: 465.19.01 CUDA version: 11.3 You could try using these settings.

sungmin9939 commented 7 months ago

would relatively high version of GPU hardware be a problem? My GPU is RTX 3090 and driver version is 535.146.02 and I'm getting segmentation fault as the original author

jmunkberg commented 7 months ago

Newer GPUs and drivers shouldn't be an issue, I hope. It has been a while since we released this code, but I just tested on two setups without issues.

Setup 1: Windows Desktop RTX 6000 Ada Gen w/ driver 545.84 PyTorch version: 2.0.0+cu117

Setup 2: Linux Server V100 w/ Driver 515.86.01 Using the Dockerfile from the nvdiffrecmc repo https://github.com/NVlabs/nvdiffrecmc/blob/main/docker/Dockerfile

NVlabs / nvdiffrecmc

Segmentation fault happened while calling DMTetGeometry #13