NVlabs / nvdiffrec

Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".
Other
2.09k stars 222 forks source link

The core dumping occurs when training with fixed topology in pass 2. #17

Closed LZleejean closed 2 years ago

LZleejean commented 2 years ago

Thanks a lot for your contribution for the community.

I use the docker environment and the gpu is V-100.

Runing the script, train.py, and there is an error that "Floating point exception (core dumped” when training with fixed topology in pass 2.

jmunkberg commented 2 years ago

Hello @LZleejean ,

Can you provide some more information about this issue, like the full command line and the config you were trying to run.

LZleejean commented 2 years ago

The config is:

iter 5000 batch 8 spp 1 layers 1 train_res [1024, 1024] display_res [1024, 1024] texture_res [2048, 2048] display_interval 0 save_interval 1000 learning_rate [0.03, 0.03] min_roughness 0.08 custom_mip False random_textures True background white loss logl1 out_dir out/hushoushuang ref_mesh data/ours/hushoushuang_rescaled base_mesh None validate False mtl_override None dmtet_grid 128 mesh_scale 2.5 env_scale 1.0 envmap None display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}] camera_space_light False lock_light False lock_pos False sdf_regularizer 0.2 laplace relative laplace_scale 10000.0 pre_load True kd_min [0.03, 0.03, 0.03] kd_max [0.8, 0.8, 0.8] ks_min [0, 0.08, 0.0] ks_max [0, 1.0, 1.0] nrm_min [-1.0, -1.0, 0.0] nrm_max [1.0, 1.0, 1.0] cam_near_far [0.1, 1000.0] learn_light True local_rank 0 multi_gpu False

The error is: iter= 5000, img_loss=0.006174, reg_loss=0.014355, lr=0.00300, time=598.2 ms, rem=0.00 s Base mesh has 73626 triangles and 36699 vertices. Writing mesh: out/hushoushuang/dmtet_mesh/mesh.obj writing 36699 vertices writing 75832 texcoords writing 36699 normals writing 73626 faces Writing material: out/hushoushuang/dmtet_mesh/mesh.mtl Done exporting mesh start sencond optimization with fixed topology! Segmentation fault (core dumped)

jmunkberg commented 2 years ago

Thanks,

I suspect it may be a memory issue. Training at 1k x 1k with batch 8 close or above the limit of what a V-100 GPU supports. In the NeRD examples, we used a resolution of 800 x 800 with batch size 8, and IIRC, that was close to the limit.

Try reducing the batch size to four and reducing the texture resolution to 1024 1024:

    "texture_res": [ 1024, 1024 ],
    "batch": 4,

Just to verify, does the small examples, say python train.py --config configs/bob.json run on your machine?

LZleejean commented 2 years ago

The used memory is only <10G (total 32G), and we try the configs/spot.json and get same error.

Importantly, all experiment work when I set the validate true. I don't know what happend.

luojin commented 1 year ago

@LZleejean hello, how to solve the error "Segmentation fault "? I encounter the sam problem . thx again