Mesh extraction issue with outdoor scene

Ryan-ZL-Lin commented 11 months ago

Hi After successfully visualize the Lego example with great mesh look, I decided to try outdoor scene (SCENE_TYPE = outdoor) with more images. When running mesh extraction command, I encountered an issue and I'm not sure whether it's a GPU memory problem or not.

Here is the command I use: torchrun --nproc_per_node=${GPUS} projects/neuralangelo/scripts/extract_mesh.py --config=${CONFIG} --checkpoint=${CHECKPOINT} --output_file=${OUTPUT_MESH} --resolution=${RESOLUTION} --block_res=${BLOCK_RES} --textured --keep_lcc

and here is the error log

(Setting affinity with NVML failed, skipping...)
Running mesh extraction with 1 GPUs.
Setup trainer.
Using random seed 0
/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/tinycudann/modules.py:53: UserWarning: tinycudann was built for lower compute capability (86) than the system's (89). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptimal.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Loading checkpoint (local): logs/MBC_group/MBC50_R1/epoch_00311_iteration_000500000_checkpoint.pt
- Loading the model...
Done with loading the checkpoint.
Extracting surface at resolution 1536 931 1323
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 7223) of binary: /home/ryan_lin/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
projects/neuralangelo/scripts/extract_mesh.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-23_10:43:10
  host      : RyanLegionPro7i.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 7223)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 7223
=====================================================

I tried to adjust some parameters such as RESOLUTIONand BLOCK_RES used in the command to see whether it makes any difference, the only successful parameter set is RESOLUTION=512 and BLOCK_RES=32 where the quality is extremely bad (the output PLY file is 90MB while lego example PLY file is 172 MB), is there anyway I could successfully extract the mesh with better quality output?

chenhsuanlin commented 11 months ago

Hi @Ryan-ZL-Lin, you could set a higher RESOLUTION while keeping the same BLOCK_RES for the GPU memory budget.

Ryan-ZL-Lin commented 11 months ago

Thanks @chenhsuanlin Is there any recommended range for RESOLUTION? for example any number from 2048 to 8192 as long as it's the multiple of 2?

Ryan-ZL-Lin commented 11 months ago

@chenhsuanlin
I tried out your suggestion to set RESOLUTION=4096and BLOCK_RES=32 to extract the surface for a 40 secs video. Initially, the estimated time to complete is around 4 hours (~ 300 iterations per sec), and it ran smoothly. However, after about 1 hour, the progress started to slow down quite a lot, here are the screenshots for your reference.

Issue : Although the surface extraction process didn't stop, the estimated time became 1120 hours.

I checked the GPU and VRAM utilization, and it turned out that they are not utilized properly

the progress became worse, the estimated time to complete changed to 17272 hours...

NVlabs / neuralangelo

Mesh extraction issue with outdoor scene #122