The core dumping occurs when training with fixed topology in pass 2.

LZleejean commented 2 years ago

Thanks a lot for your contribution for the community.

I use the docker environment and the gpu is V-100.

Runing the script, train.py, and there is an error that "Floating point exception (core dumped” when training with fixed topology in pass 2.

jmunkberg commented 2 years ago

Hello @LZleejean ,

Can you provide some more information about this issue, like the full command line and the config you were trying to run.

LZleejean commented 2 years ago

The config is:

iter 5000 batch 8 spp 1 layers 1 train_res [1024, 1024] display_res [1024, 1024] texture_res [2048, 2048] display_interval 0 save_interval 1000 learning_rate [0.03, 0.03] min_roughness 0.08 custom_mip False random_textures True background white loss logl1 out_dir out/hushoushuang ref_mesh data/ours/hushoushuang_rescaled base_mesh None validate False mtl_override None dmtet_grid 128 mesh_scale 2.5 env_scale 1.0 envmap None display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}] camera_space_light False lock_light False lock_pos False sdf_regularizer 0.2 laplace relative laplace_scale 10000.0 pre_load True kd_min [0.03, 0.03, 0.03] kd_max [0.8, 0.8, 0.8] ks_min [0, 0.08, 0.0] ks_max [0, 1.0, 1.0] nrm_min [-1.0, -1.0, 0.0] nrm_max [1.0, 1.0, 1.0] cam_near_far [0.1, 1000.0] learn_light True local_rank 0 multi_gpu False

The error is: iter= 5000, img_loss=0.006174, reg_loss=0.014355, lr=0.00300, time=598.2 ms, rem=0.00 s Base mesh has 73626 triangles and 36699 vertices. Writing mesh: out/hushoushuang/dmtet_mesh/mesh.obj writing 36699 vertices writing 75832 texcoords writing 36699 normals writing 73626 faces Writing material: out/hushoushuang/dmtet_mesh/mesh.mtl Done exporting mesh start sencond optimization with fixed topology! Segmentation fault (core dumped)

jmunkberg commented 2 years ago

Thanks,

I suspect it may be a memory issue. Training at 1k x 1k with batch 8 close or above the limit of what a V-100 GPU supports. In the NeRD examples, we used a resolution of 800 x 800 with batch size 8, and IIRC, that was close to the limit.

Try reducing the batch size to four and reducing the texture resolution to 1024 1024:

    "texture_res": [ 1024, 1024 ],
    "batch": 4,

Just to verify, does the small examples, say python train.py --config configs/bob.json run on your machine?

LZleejean commented 2 years ago

The used memory is only <10G (total 32G), and we try the configs/spot.json and get same error.

Importantly, all experiment work when I set the validate true. I don't know what happend.

luojin commented 1 year ago

@LZleejean hello, how to solve the error "Segmentation fault "? I encounter the sam problem . thx again

NVlabs / nvdiffrec

The core dumping occurs when training with fixed topology in pass 2. #17