NVlabs / tiny-cuda-nn

Lightning fast C++/CUDA neural network framework
Other
3.77k stars 458 forks source link

tiny-cuda-nn wheel does not build in Docker image (it loops indefinitely without failing) #475

Open violetamenendez opened 3 weeks ago

violetamenendez commented 3 weeks ago

Hi,

I am trying to create a Docker image for nerfstudio based on this one: https://hub.docker.com/layers/dromni/nerfstudio/1.1.4/images/sha256-ff0107a7db96bb8ee29c638729328b832b268b890c50f2a2ff25988bb84d4f75?context=explore

But the tiny-cuda-nn wheel build loops forever, not failing, but also not succeeding, until the build times out.

I am following the installation instructions from nerfstudio here: https://github.com/nerfstudio-project/nerfstudio?tab=readme-ov-file#dependencies Which coindices with the instructions in this tiny-cuda-nn repo. In fact, when I use a previous Docker image version, dromni/nerfstudio:0.1.16, with older version of the libraries and CUDA 11.7, it all works fine. The problematic Docker file is:

FROM dromni/nerfstudio:1.1.4
WORKDIR /
USER root
# Setup NeRFStudio
RUN cd /workspace && git clone https://github.com/nerfstudio-project/nerfstudio.git && \
    cd /workspace/nerfstudio && \
    pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118 && \
    pip install ninja gsplat && \
    pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch && \
    pip install --upgrade pip setuptools && \
    pip install -e .

If I remove the installation of tiny-cuda-nn, everything else builds perfectly fine. Otherwise I get this log:

#5 174.6 Collecting git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
#5 174.6   Cloning https://github.com/NVlabs/tiny-cuda-nn/ to /tmp/pip-req-build-_rc_iady
#5 174.6   Running command git clone --filter=blob:none --quiet https://github.com/NVlabs/tiny-cuda-nn/ /tmp/pip-req-build-_rc_iady
#5 176.4   Resolved https://github.com/NVlabs/tiny-cuda-nn/ to commit c91138bcd4c6877c8d5e60e483c0581aafc70cce
#5 176.4   Running command git submodule update --init --recursive -q
#5 183.6   Preparing metadata (setup.py): started
#5 187.7   Preparing metadata (setup.py): finished with status 'done'
#5 187.9 Collecting ninja
#5 188.0   Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
#5 188.1      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 KB 2.4 MB/s eta 0:00:00
#5 188.2 Building wheels for collected packages: tinycudann
#5 188.2   Building wheel for tinycudann (setup.py): started
#5 278.6   Building wheel for tinycudann (setup.py): still running...
#5 592.7   Building wheel for tinycudann (setup.py): still running...
#5 777.5   Building wheel for tinycudann (setup.py): still running...
#5 1176.3   Building wheel for tinycudann (setup.py): still running...
#5 1270.0   Building wheel for tinycudann (setup.py): still running...
#5 1651.6   Building wheel for tinycudann (setup.py): still running...
#5 1917.0   Building wheel for tinycudann (setup.py): still running...
#5 2252.7   Building wheel for tinycudann (setup.py): still running...
#5 2339.5   Building wheel for tinycudann (setup.py): still running...
#5 2701.9   Building wheel for tinycudann (setup.py): still running...
#5 2940.4   Building wheel for tinycudann (setup.py): still running...
#5 3287.7   Building wheel for tinycudann (setup.py): still running...
#5 CANCELED
context canceled
ERROR: Job failed: execution took longer than 1h0m0s seconds

I passed the --verbose flag to pip and I got one numpy error early on (which does not make the job fail), and then looping through some warnings while the wheel tries to build indefinitely:

Numpy:

#6 156.8 Building wheels for collected packages: tinycudann
#6 156.8   Building wheel for tinycudann (setup.py): started
#6 156.8   Running command python setup.py bdist_wheel
#6 157.8 
#6 157.8   A module that was compiled using NumPy 1.x cannot be run in
#6 157.8   NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
#6 157.8   versions of NumPy, modules must be compiled with NumPy 2.0.
#6 157.8   Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
#6 157.8 
#6 157.8   If you are a user of the module, the easiest solution will be to
#6 157.8   downgrade to 'numpy<2' or try to upgrade the affected module.
#6 157.8   We expect that some modules will need time to support NumPy 2.
#6 157.8 
#6 157.8   Traceback (most recent call last):  File "<string>", line 2, in <module>
#6 157.8     File "<pip-setuptools-caller>", line 34, in <module>
#6 157.8     File "/tmp/pip-req-build-cm6ig4ie/bindings/torch/setup.py", line 9, in <module>
#6 157.8       import torch
#6 157.8     File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1382, in <module>
#6 157.8       from .functional import *  # noqa: F403
#6 157.8     File "/usr/local/lib/python3.10/dist-packages/torch/functional.py", line 7, in <module>
#6 157.8       import torch.nn.functional as F
#6 157.8     File "/usr/local/lib/python3.10/dist-packages/torch/nn/__init__.py", line 1, in <module>
#6 157.8       from .modules import *  # noqa: F403
#6 157.8     File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/__init__.py", line 35, in <module>
#6 157.8       from .transformer import TransformerEncoder, TransformerDecoder, \
#6 157.8     File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py", line 20, in <module>
#6 157.8       device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
#6 157.8   /usr/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
#6 157.8     device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

Warnings loop:

#6 189.3   [6/10] /usr/local/cuda/bin/nvcc  -I/tmp/pip-req-build-cm6ig4ie/include -I/tmp/pip-req-build-cm6ig4ie/dependencies -I/tmp/pip-req-build-cm6ig4ie/dependencies/cutlass/include -I/tmp/pip-req-build-cm6ig4ie/dependencies/cutlass/tools/util/include -I/tmp/pip-req-build-cm6ig4ie/dependencies/fmt/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pip-req-build-cm6ig4ie/src/object.cu -o /tmp/pip-req-build-cm6ig4ie/bindings/torch/src/object.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -std=c++17 --extended-lambda --expt-relaxed-constexpr -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -Xcompiler=-Wno-float-conversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -DTCNN_PARAMS_UNALIGNED -DTCNN_MIN_GPU_ARCH=90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_90_C -D_GLIBCXX_USE_CXX11_ABI=0
#6 189.3   /tmp/pip-req-build-cm6ig4ie/dependencies/fmt/include/fmt/core.h(288): warning #1675-D: unrecognized GCC pragma
#6 189.3 
#6 189.3   /tmp/pip-req-build-cm6ig4ie/dependencies/fmt/include/fmt/core.h(288): warning #1675-D: unrecognized GCC pragma
#6 189.3 
#6 241.3   [7/10] c++ -MMD -MF /tmp/pip-req-build-cm6ig4ie/bindings/torch/build/temp.linux-x86_64-3.10/tinycudann/bindings.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/tmp/pip-req-build-cm6ig4ie/include -I/tmp/pip-req-build-cm6ig4ie/dependencies -I/tmp/pip-req-build-cm6ig4ie/dependencies/cutlass/include -I/tmp/pip-req-build-cm6ig4ie/dependencies/cutlass/tools/util/include -I/tmp/pip-req-build-cm6ig4ie/dependencies/fmt/include -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /tmp/pip-req-build-cm6ig4ie/bindings/torch/tinycudann/bindings.cpp -o /tmp/pip-req-build-cm6ig4ie/bindings/torch/build/temp.linux-x86_64-3.10/tinycudann/bindings.o -std=c++17 -DTCNN_PARAMS_UNALIGNED -DTCNN_MIN_GPU_ARCH=90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_90_C -D_GLIBCXX_USE_CXX11_ABI=0
#6 241.3   In file included from /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/Exceptions.h:14,
#6 241.3                    from /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include/torch/python.h:11,
#6 241.3                    from /usr/local/lib/python3.10/dist-packages/torch/include/torch/extension.h:9,
#6 241.3                    from /tmp/pip-req-build-cm6ig4ie/bindings/torch/tinycudann/bindings.cpp:34:
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h: In instantiation of ‘class pybind11::class_<tcnn::cpp::LogSeverity>’:
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:2170:7:   required from ‘class pybind11::enum_<tcnn::cpp::LogSeverity>’
#6 241.3   /tmp/pip-req-build-cm6ig4ie/bindings/torch/tinycudann/bindings.cpp:283:52:   required from here
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1496:7: warning: ‘pybind11::class_<tcnn::cpp::LogSeverity>’ declared with greater visibility than its base ‘pybind11::detail::generic_type’ [-Wattributes]
#6 241.3    1496 | class class_ : public detail::generic_type {
#6 241.3         |       ^~~~~~
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h: In instantiation of ‘class pybind11::class_<tcnn::cpp::Precision>’:
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:2170:7:   required from ‘class pybind11::enum_<tcnn::cpp::Precision>’
#6 241.3   /tmp/pip-req-build-cm6ig4ie/bindings/torch/tinycudann/bindings.cpp:292:48:   required from here
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1496:7: warning: ‘pybind11::class_<tcnn::cpp::Precision>’ declared with greater visibility than its base ‘pybind11::detail::generic_type’ [-Wattributes]
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h: In instantiation of ‘class pybind11::class_<tcnn::cpp::Context>’:
#6 241.3   /tmp/pip-req-build-cm6ig4ie/bindings/torch/tinycudann/bindings.cpp:309:45:   required from here
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1496:7: warning: ‘pybind11::class_<tcnn::cpp::Context>’ declared with greater visibility than its base ‘pybind11::detail::generic_type’ [-Wattributes]
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h: In instantiation of ‘class pybind11::class_<Module>’:
#6 241.3   /tmp/pip-req-build-cm6ig4ie/bindings/torch/tinycudann/bindings.cpp:316:32:   required from here
#6 241.3   /usr/local/lib/python3.10/dist-packages/torch/include/pybind11/pybind11.h:1496:7: warning: ‘pybind11::class_<Module>’ declared with greater visibility than its base ‘pybind11::detail::generic_type’ [-Wattributes]

I have attached a longer log output for more context tiny-cuda-nn-wheel-docker-log.txt

I cannot really make much sense of these logs, and I have ran out of ideas on how to debug this, so any help is very appreciated. Thank you!

j-nordling commented 3 weeks ago

I am running into the same issue