Hello,
I am running the code on an A10 GPU. The training process for Neuralangelo works fine and I am trying to run the mesh extraction. However, I keep getting this:
(fyi I tried with resolution 1028 and 2048 but still got same error)
- Loading the model...
Done with loading the checkpoint.
Extracting surface at resolution 1035 1522 2048
0%| | 0/1728 [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/usr/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2913) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "projects/neuralangelo/scripts/extract_mesh.py", line 106, in <module>
main()
File "projects/neuralangelo/scripts/extract_mesh.py", line 88, in main
mesh = extract_mesh(sdf_func=sdf_func, bounds=bounds,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/neuralangelo/projects/neuralangelo/utils/mesh.py", line 31, in extract_mesh
for it, data in enumerate(data_loader):
File "/usr/local/lib/python3.8/dist-packages/tqdm/std.py", line 1182, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1285, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 2913) exited unexpectedly
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2596) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
projects/neuralangelo/scripts/extract_mesh.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-22_17:34:17
host : 8c8b262fd3ca
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2596)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I looked at the OOM FAQ section but I don't know how to fix this kind of issue for the mesh extraction step. It also seems to be complaining about shared memory and not CUDA memory. I have a feeling this is related to the Docker container but I looked up how to resolve it and I am not sure which route to take. Let me know if there is a solution for this that is recommended!
Hello, I am running the code on an A10 GPU. The training process for Neuralangelo works fine and I am trying to run the mesh extraction. However, I keep getting this: (fyi I tried with resolution 1028 and 2048 but still got same error)
I looked at the OOM FAQ section but I don't know how to fix this kind of issue for the mesh extraction step. It also seems to be complaining about shared memory and not CUDA memory. I have a feeling this is related to the Docker container but I looked up how to resolve it and I am not sure which route to take. Let me know if there is a solution for this that is recommended!