Closed raynehe closed 1 year ago
Hi, from the error
error: #error The version of CUB in your include path is not compatible with this release of Thrust. CUB is now included in the CUDA Toolkit, so you no longer need to use your own checkout of CUB. Define THRUST_IGNORE_CUB_VERSION_CHECK to ignore this.
#error The version of CUB in your include path is not compatible with this release of Thrust. CUB is now included in the CUDA Toolkit, so you no longer need to use your own checkout of CUB. Define THRUST_IGNORE_CUB_VERSION_CHECK to ignore this.
Maybe you could run like
THRUST_IGNORE_CUB_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 --nnodes=1 --node_rank=0 training/exp_runner.py --conf confs/dtu_mlp_3views.conf --scan_id 65
Thanks very much! Sadly, I still got the error message as below:
(nf22) rayne@phil-OMEN-by-HP-45L-Gaming-Desktop-GT22-0xxx:~/code/monosdf/code$ THRUST_IGNORE_CUB_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 --nnodes=1 --node_rank=0 training/exp_runner.py --conf confs/dtu_mlp_3views.conf --scan_id 65
/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
RANK and WORLD_SIZE in environ: 0/1
opt.local_rank 0
shell command : training/exp_runner.py --local_rank=0 --conf confs/dtu_mlp_3views.conf --scan_id 65
Loading data ...
Finish loading data. Data-set size: 49
Detected CUDA files, patching ldflags
Emitting ninja build file ./tmp_build/build.ninja...
Building extension module _hash_encoder...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] :/usr/local/cuda-11.3/bin/nvcc -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/THC -isystem :/usr/local/cuda-11.3/include -isystem /home/rayne/anaconda3/envs/nf22/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -std=c++14 -allow-unsupported-compiler -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -c /home/rayne/code/monosdf/code/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o
FAILED: hashencoder.cuda.o
:/usr/local/cuda-11.3/bin/nvcc -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/THC -isystem :/usr/local/cuda-11.3/include -isystem /home/rayne/anaconda3/envs/nf22/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -std=c++14 -allow-unsupported-compiler -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -c /home/rayne/code/monosdf/code/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o
/bin/sh: 1: :/usr/local/cuda-11.3/bin/nvcc: not found
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "training/exp_runner.py", line 58, in <module>
trainrunner = MonoSDFTrainRunner(conf=opt.conf,
File "/home/rayne/code/monosdf/code/../code/training/monosdf_train.py", line 107, in __init__
self.model = utils.get_class(self.conf.get_string('train.model_class'))(conf=conf_model)
File "/home/rayne/code/monosdf/code/../code/utils/general.py", line 17, in get_class
m = __import__(module)
File "/home/rayne/code/monosdf/code/../code/model/network.py", line 140, in <module>
from hashencoder.hashgrid import _hash_encode, HashEncoder
File "/home/rayne/code/monosdf/code/../code/hashencoder/__init__.py", line 1, in <module>
from .hashgrid import HashEncoder
File "/home/rayne/code/monosdf/code/../code/hashencoder/hashgrid.py", line 12, in <module>
from .backend import _backend
File "/home/rayne/code/monosdf/code/../code/hashencoder/backend.py", line 10, in <module>
_backend = load(name='_hash_encoder',
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1144, in load
return _jit_compile(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension '_hash_encoder'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1367592) of binary: /home/rayne/anaconda3/envs/nf22/bin/python
Traceback (most recent call last):
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
training/exp_runner.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-11-20_16:44:50
host : phil-OMEN-by-HP-45L-Gaming-Desktop-GT22-0xxx
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1367592)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Maybe I should try a different way to "Define THRUST_IGNORE_CUB_VERSION_CHECK"?
Could you remove the temp folder tmp_build
and try again?
I remove the tmp_build
under hashencoder
and run:
THRUST_IGNORE_CUB_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 --nnodes=1 --node_rank=0 training/exp_runner.py --conf confs/dtu_mlp_3views.conf --scan_id 65
The error remains the same😟
I think you could try to modify it here: https://github.com/autonomousvision/monosdf/blob/main/code/hashencoder/backend.py#L14
I'm thinking whether it's due to the version of PyTorch?
The different between my installation and README is that,
If I use:conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
This would install a cpu version of pytorch:
# Name Version Build Channel
pytorch 1.13.0 py3.8_cpu_0 pytorch
Thus when installing, I use pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
instead.
There's no other difference between my installation and README.
I think you could try to modify it here: https://github.com/autonomousvision/monosdf/blob/main/code/hashencoder/backend.py#L14
After adding THRUST_IGNORE_CUB_VERSION_CHECK=1
at the place you've mentioned, the error changes😟 (I have deleted the tmp_build
)
Error:
(nf22) rayne@phil-OMEN-by-HP-45L-Gaming-Desktop-GT22-0xxx:~/code/monosdf/code$ python -m torch.distributed.launch --nproc_per_node 1 --nnodes=1 --node_rank=0 training/exp_runner.py --conf confs/dtu_mlp_3views.conf --scan_id 65
/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
RANK and WORLD_SIZE in environ: 0/1
opt.local_rank 0
shell command : training/exp_runner.py --local_rank=0 --conf confs/dtu_mlp_3views.conf --scan_id 65
Loading data ...
Finish loading data. Data-set size: 49
Detected CUDA files, patching ldflags
Emitting ninja build file ./tmp_build/build.ninja...
Building extension module _hash_encoder...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] :/usr/local/cuda-11.3/bin/nvcc -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/THC -isystem :/usr/local/cuda-11.3/include -isystem /home/rayne/anaconda3/envs/nf22/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -std=c++14 -allow-unsupported-compiler -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ THRUST_IGNORE_CUB_VERSION_CHECK=1 -c /home/rayne/code/monosdf/code/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o
FAILED: hashencoder.cuda.o
:/usr/local/cuda-11.3/bin/nvcc -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/THC -isystem :/usr/local/cuda-11.3/include -isystem /home/rayne/anaconda3/envs/nf22/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -std=c++14 -allow-unsupported-compiler -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ THRUST_IGNORE_CUB_VERSION_CHECK=1 -c /home/rayne/code/monosdf/code/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o
/bin/sh: 1: :/usr/local/cuda-11.3/bin/nvcc: not found
[2/3] c++ -MMD -MF bindings.o.d -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/include/THC -isystem :/usr/local/cuda-11.3/include -isystem /home/rayne/anaconda3/envs/nf22/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -c /home/rayne/code/monosdf/code/hashencoder/src/bindings.cpp -o bindings.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "training/exp_runner.py", line 58, in <module>
trainrunner = MonoSDFTrainRunner(conf=opt.conf,
File "/home/rayne/code/monosdf/code/../code/training/monosdf_train.py", line 107, in __init__
self.model = utils.get_class(self.conf.get_string('train.model_class'))(conf=conf_model)
File "/home/rayne/code/monosdf/code/../code/utils/general.py", line 17, in get_class
m = __import__(module)
File "/home/rayne/code/monosdf/code/../code/model/network.py", line 140, in <module>
from hashencoder.hashgrid import _hash_encode, HashEncoder
File "/home/rayne/code/monosdf/code/../code/hashencoder/__init__.py", line 1, in <module>
from .hashgrid import HashEncoder
File "/home/rayne/code/monosdf/code/../code/hashencoder/hashgrid.py", line 12, in <module>
from .backend import _backend
File "/home/rayne/code/monosdf/code/../code/hashencoder/backend.py", line 10, in <module>
_backend = load(name='_hash_encoder',
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1144, in load
return _jit_compile(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension '_hash_encoder'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1372359) of binary: /home/rayne/anaconda3/envs/nf22/bin/python
Traceback (most recent call last):
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/rayne/anaconda3/envs/nf22/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
training/exp_runner.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-11-20_17:23:18
host : phil-OMEN-by-HP-45L-Gaming-Desktop-GT22-0xxx
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1372359)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hi, now the error is
/bin/sh: 1: :/usr/local/cuda-11.3/bin/nvcc: not found
I would suggest you re-install conda environment follow our readme and install pytorch with
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
Sorry to bother again, I'm wondering did you mean that I should do as follows:
(instead of using conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
)
conda create -y -n monosdf python=3.8
conda activate monosdf
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
conda install cudatoolkit-dev=11.3 -c conda-forge
Yes
Hello, I re-install a new environment, and the error changed again:
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
Could you please help me about how to solve this?
The whole error is:
(nf2) rayne@phil-OMEN-by-HP-45L-Gaming-Desktop-GT22-0xxx:~/code/monosdf/code$ CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 --nnodes=1 --node_rank=0 training/exp_runner.py --conf confs/dtu_mlp_3views.conf --scan_id 65
/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
RANK and WORLD_SIZE in environ: 0/1
opt.local_rank 0
shell command : training/exp_runner.py --local_rank=0 --conf confs/dtu_mlp_3views.conf --scan_id 65
Loading data ...
Finish loading data. Data-set size: 49
Detected CUDA files, patching ldflags
Emitting ninja build file ./tmp_build/build.ninja...
Building extension module _hash_encoder...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/rayne/anaconda3/envs/nf2/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -std=c++14 -allow-unsupported-compiler -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ THRUST_IGNORE_CUB_VERSION_CHECK -c /home/rayne/code/monosdf/code/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o
FAILED: hashencoder.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/rayne/anaconda3/envs/nf2/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -std=c++14 -allow-unsupported-compiler -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ THRUST_IGNORE_CUB_VERSION_CHECK -c /home/rayne/code/monosdf/code/hashencoder/src/hashencoder.cu -o hashencoder.cuda.o
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
[2/3] c++ -MMD -MF bindings.o.d -DTORCH_EXTENSION_NAME=_hash_encoder -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/TH -isystem /home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/rayne/anaconda3/envs/nf2/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -c /home/rayne/code/monosdf/code/hashencoder/src/bindings.cpp -o bindings.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
subprocess.run(
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "training/exp_runner.py", line 58, in <module>
trainrunner = MonoSDFTrainRunner(conf=opt.conf,
File "/home/rayne/code/monosdf/code/../code/training/monosdf_train.py", line 107, in __init__
self.model = utils.get_class(self.conf.get_string('train.model_class'))(conf=conf_model)
File "/home/rayne/code/monosdf/code/../code/utils/general.py", line 17, in get_class
m = __import__(module)
File "/home/rayne/code/monosdf/code/../code/model/network.py", line 140, in <module>
from hashencoder.hashgrid import _hash_encode, HashEncoder
File "/home/rayne/code/monosdf/code/../code/hashencoder/__init__.py", line 1, in <module>
from .hashgrid import HashEncoder
File "/home/rayne/code/monosdf/code/../code/hashencoder/hashgrid.py", line 12, in <module>
from .backend import _backend
File "/home/rayne/code/monosdf/code/../code/hashencoder/backend.py", line 10, in <module>
_backend = load(name='_hash_encoder',
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension '_hash_encoder'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1377802) of binary: /home/rayne/anaconda3/envs/nf2/bin/python
Traceback (most recent call last):
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/rayne/anaconda3/envs/nf2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
training/exp_runner.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-11-20_18:56:51
host : phil-OMEN-by-HP-45L-Gaming-Desktop-GT22-0xxx
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1377802)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The coda list is:
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.3.0 pypi_0 pypi
ca-certificates 2022.9.24 ha878542_0 conda-forge
cachetools 5.2.0 pypi_0 pypi
certifi 2022.9.24 pyhd8ed1ab_0 conda-forge
charset-normalizer 2.1.1 pypi_0 pypi
contourpy 1.0.6 pypi_0 pypi
cudatoolkit-dev 11.3.1 py38h497a2fe_0 conda-forge
cycler 0.11.0 pypi_0 pypi
fonttools 4.38.0 pypi_0 pypi
google-auth 2.14.1 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
grpcio 1.50.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
imageio 2.22.4 pypi_0 pypi
importlib-metadata 5.0.0 pypi_0 pypi
kiwisolver 1.4.4 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.2 h295c915_4
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
markdown 3.4.1 pypi_0 pypi
markupsafe 2.1.1 pypi_0 pypi
matplotlib 3.6.2 pypi_0 pypi
ncurses 6.3 h5eee18b_3
networkx 2.8.8 pypi_0 pypi
ninja 1.11.1 pypi_0 pypi
numpy 1.23.5 pypi_0 pypi
oauthlib 3.2.2 pypi_0 pypi
opencv-python 4.6.0.66 pypi_0 pypi
openssl 1.1.1s h7f8727e_0
packaging 21.3 pypi_0 pypi
pillow 9.3.0 pypi_0 pypi
pip 22.2.2 py38h06a4308_0
protobuf 3.20.3 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pyhocon 0.3.59 pypi_0 pypi
pyparsing 2.4.7 pypi_0 pypi
python 3.8.15 h3fd9d12_0
python-dateutil 2.8.2 pypi_0 pypi
python_abi 3.8 2_cp38 conda-forge
pywavelets 1.4.1 pypi_0 pypi
readline 8.2 h5eee18b_0
requests 2.28.1 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
rsa 4.9 pypi_0 pypi
scikit-image 0.19.3 pypi_0 pypi
scipy 1.9.3 pypi_0 pypi
setuptools 65.5.0 py38h06a4308_0
six 1.16.0 pypi_0 pypi
sqlite 3.39.3 h5082296_0
tensorboard 2.11.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tifffile 2022.10.10 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
torch 1.12.1+cu113 pypi_0 pypi
torchvision 0.13.1+cu113 pypi_0 pypi
tqdm 4.64.1 pypi_0 pypi
trimesh 3.16.4 pypi_0 pypi
typing-extensions 4.4.0 pypi_0 pypi
urllib3 1.26.12 pypi_0 pypi
werkzeug 2.2.2 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.6 h5eee18b_0
zipp 3.10.0 pypi_0 pypi
zlib 1.2.13 h5eee18b_0
Thank you for your help!
I think you need to remove THRUST_IGNORE_CUB_VERSION_CHECK
or try to change it as -DTHRUST_IGNORE_CUB_VERSION_CHECK=1
That works. Thanks for your kind help!
Hi, thanks for your wonderful work!
When running command to train monosdf, the following error is reported. I'm running it on Ubuntu 20.04 machine.
I have looked at issue #19 and command the line, but the error still remains (as shown below).
I have installed cudatoolkit-11.3 and cudatoolket-dev-11.3. You might refer to the conda list below.
The error log:
The conda list:
Could you give me a hint about that? Thanks for your help!