NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)
https://research.nvidia.com/labs/dir/neuralangelo/
Other
4.38k stars 388 forks source link

Unable to run Neuralangelo; NVML not supported #15

Closed mitdave95 closed 1 year ago

mitdave95 commented 1 year ago

Getting the below error

torchrun --nproc_per_node=1 train.py --logdir=logs/sample/toy_example --config=projects/neuralangelo/configs/custom/toy_example.yaml --show_pbar
Traceback (most recent call last):
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 46, in main
    set_affinity(args.local_rank)
  File "/data/imaginaire/utils/gpu_affinity.py", line 74, in set_affinity
    os.sched_setaffinity(0, dev.get_cpu_affinity())
  File "/data/imaginaire/utils/gpu_affinity.py", line 50, in get_cpu_affinity
    for j in pynvml.nvmlDeviceGetCpuAffinity(self.handle, Device._nvml_affinity_elements):
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1745, in nvmlDeviceGetCpuAffinity
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 442) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-14_16:29:36
  host      : c7c816135a1c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 442)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Running on Windows 11, RTX 4090, from WSL Ubuntu 22.04.02 with --gpus all flag

chenhsuanlin commented 1 year ago

Hi @mitdave95, could you try commenting out this line? This is an optional function that sets the processor affinity. If this resolves your issue, I can push a hotfix. Thanks!

mitdave95 commented 1 year ago

@chenhsuanlin it worked! thanks also, need to comment this line in extract_mesh.py

chenhsuanlin commented 1 year ago

Fixed in 3b1b95f! Please feel free to reopen if the issue persists.