gengshan-y / viser

ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction. NeurIPS 2021.
https://viser-shape.github.io/
Apache License 2.0
73 stars 6 forks source link

Failure during optimization #5

Closed ecmjohnson closed 2 years ago

ecmjohnson commented 2 years ago

Hello, I'm trying to run ViSER on some of my own datasets. Out of my 5 datasets, 2 succeed and 3 fail: all with the same failure case:

> /HPS/articulated_nerf/work/viser/nnutils/mesh_net.py(809)forward()
-> self.match_loss = (csm_pred - csm_gt).norm(2,1)[mask].mean() * 0.1
(Pdb) 
Traceback (most recent call last):
  File "optimize.py", line 59, in <module>
    app.run(main)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "optimize.py", line 56, in main
    trainer.train()
  File "/HPS/articulated_nerf/work/viser/nnutils/train_utils.py", line 339, in train
    total_loss,aux_output = self.model(input_batch)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/HPS/articulated_nerf/work/viser/nnutils/mesh_net.py", line 809, in forward
    self.match_loss = (csm_pred - csm_gt).norm(2,1)[mask].mean() * 0.1
  File "/HPS/articulated_nerf/work/viser/nnutils/mesh_net.py", line 809, in forward
    self.match_loss = (csm_pred - csm_gt).norm(2,1)[mask].mean() * 0.1
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
Traceback (most recent call last):
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/HPS/articulated_nerf/work/miniconda3/envs/viser/bin/python', '-u', 'optimize.py', '--local_rank=0', '--name=cactus_full-1003-0', '--checkpoint_dir', 'log', '--n_bones', '21', '--num_epochs', '20', '--dataname', 'cactus_full-init', '--ngpu', '1', '--batch_size', '4', '--seed', '1003']' returned non-zero exit status 1.
Killing subprocess 6097

Full error log 1 Full error log 2 Full error log 3

I would tend to assume this is a division by zero in the identified line. Have you encountered this issue before?

I have tried multiple values of init_frame and end_frame for initially optimizing on a subset (where the failure occurs). I have also tried different seed values. I haven't found any choice of these parameters that cause these datasets to avoid this failure case.

Any help or insight you can provide would be appreciated

gengshan-y commented 2 years ago

Hello, this seems to be an initialization issue. The rendered mask might not be overlapped with observed mask when principal point is not initialized properly.

Does this solve the problem? https://github.com/gengshan-y/viser-release/issues/4#issuecomment-1064636328

ecmjohnson commented 2 years ago

Ah, let me clarify my understanding: so the ppx and ppy pixel coordinates are not necessarily the principal point of the camera projection (i.e. typically half width and half height respectively) and I should adjust them to be centered on the object for the start_idx frame in the init optimization. Is that correct?

I had already set the ppx and ppy to be half width and half height respectively for my datasets, but it is possible that this point did not overlap the masks in the datasets which failed.

gengshan-y commented 2 years ago

Your understanding is correct. Let me give more explanation if you are interested -- ppx, ppy is supposed to be the principal point of camera, but if we initialize them as the correct values, the renderings may not overlap due to the incorrect initial root translation estimation, which causes the problem.

I would suggest using the following to avoid tedious manual initialization of ppx, ppy:

Besides passing principal points to the config file, another option is to pass --cnnpp to optimize.py, which optimizes an image CNN to predict principal points. In this case, we have some mechanism here to ensure the silhouette rendering and ground-truth overlaps.

ecmjohnson commented 2 years ago

Ah, excellent! That solves the issue of failing during optimization

Thanks!