NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)
https://research.nvidia.com/labs/dir/neuralangelo/
Other
4.31k stars 387 forks source link

Mesh extraction issue with outdoor scene #122

Open Ryan-ZL-Lin opened 11 months ago

Ryan-ZL-Lin commented 11 months ago

Hi After successfully visualize the Lego example with great mesh look, I decided to try outdoor scene (SCENE_TYPE = outdoor) with more images. When running mesh extraction command, I encountered an issue and I'm not sure whether it's a GPU memory problem or not.

Here is the command I use: torchrun --nproc_per_node=${GPUS} projects/neuralangelo/scripts/extract_mesh.py --config=${CONFIG} --checkpoint=${CHECKPOINT} --output_file=${OUTPUT_MESH} --resolution=${RESOLUTION} --block_res=${BLOCK_RES} --textured --keep_lcc

and here is the error log

(Setting affinity with NVML failed, skipping...)
Running mesh extraction with 1 GPUs.
Setup trainer.
Using random seed 0
/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/tinycudann/modules.py:53: UserWarning: tinycudann was built for lower compute capability (86) than the system's (89). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptimal.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Loading checkpoint (local): logs/MBC_group/MBC50_R1/epoch_00311_iteration_000500000_checkpoint.pt
- Loading the model...
Done with loading the checkpoint.
Extracting surface at resolution 1536 931 1323
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 7223) of binary: /home/ryan_lin/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
projects/neuralangelo/scripts/extract_mesh.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-23_10:43:10
  host      : RyanLegionPro7i.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 7223)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 7223
=====================================================

I tried to adjust some parameters such as RESOLUTIONand BLOCK_RES used in the command to see whether it makes any difference, the only successful parameter set is RESOLUTION=512 and BLOCK_RES=32 where the quality is extremely bad (the output PLY file is 90MB while lego example PLY file is 172 MB), is there anyway I could successfully extract the mesh with better quality output?

chenhsuanlin commented 11 months ago

Hi @Ryan-ZL-Lin, you could set a higher RESOLUTION while keeping the same BLOCK_RES for the GPU memory budget.

Ryan-ZL-Lin commented 11 months ago

Thanks @chenhsuanlin Is there any recommended range for RESOLUTION? for example any number from 2048 to 8192 as long as it's the multiple of 2?

Ryan-ZL-Lin commented 11 months ago

@chenhsuanlin
I tried out your suggestion to set RESOLUTION=4096and BLOCK_RES=32 to extract the surface for a 40 secs video. Initially, the estimated time to complete is around 4 hours (~ 300 iterations per sec), and it ran smoothly. However, after about 1 hour, the progress started to slow down quite a lot, here are the screenshots for your reference.

Issue : Although the surface extraction process didn't stop, the estimated time became 1120 hours.

image

I checked the GPU and VRAM utilization, and it turned out that they are not utilized properly

image

the progress became worse, the estimated time to complete changed to 17272 hours...

image