PixarAnimationStudios / OpenSubdiv

An Open-Source subdivision surface library.
graphics.pixar.com/opensubdiv
Other
2.88k stars 558 forks source link

Building with high parallelism and CUDA support results in sporadic build failures #1313

Open amarshall opened 1 year ago

amarshall commented 1 year ago

Building with 48 threads, of 50 sequential builds, 19 failed (38% failure rate). Am building via nixpkgs drv, but I don’t see any reason why it’s specific to that build environment. Building without CUDA saw no failures in 50 runs.

My guess is there’s an implicit dependency somewhere, I spent a brief bit trying to find it but did not (I’m not very proficient with CMake).

I have seen at least two different failures:

CMake Error at /nix/store/0dv0ylafnx7cdajyv9ahbpqrniblixq1-cmake-3.26.4/share/cmake-3.26/Modules/FindCUDA/make2cmake.cmake:48 (file):
  file failed to open for reading (No such file or directory):

    /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.NVCC-depend

CMake Error at osd_static_gpu_generated_cudaKernel.cu.o.Release.cmake:236 (message):
  Error generating
  /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/./osd_static_gpu_generated_cudaKernel.cu.o

make[2]: *** [opensubdiv/CMakeFiles/osd_dynamic_gpu.dir/build.make:77: opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o] Error 1

and

Error copying file (if different) from "/build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.depend.tmp" to "/build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.depend".
CMake Error at osd_static_gpu_generated_cudaKernel.cu.o.Release.cmake:246 (message):
  Error generating
  /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/./osd_static_gpu_generated_cudaKernel.cu.o

make[2]: *** [opensubdiv/CMakeFiles/osd_dynamic_gpu.dir/build.make:77: opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o] Error 1
davidgyu commented 1 year ago

Filed as internal issue #OSD-426

davidgyu commented 1 year ago

Interesting. We haven't seen that before. Can you tell us more about your system configuration: OS, Compiler, GPU, Driver version, CUDA version?

amarshall commented 1 year ago

Hi! Thanks for the reply.

Log output of configure stage + build flags Note that I have manually wrapped the cmake flags to make them easier to read. ``` @nix { "action": "setPhase", "phase": "configurePhase" } configuring fixing cmake files... cmake flags: -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/doc/OpenSubdiv -DCMAKE_INSTALL_INFODIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/man -DCMAKE_INSTALL_OLDINCLUDEDIR=/nix/store/1np3p9y42nv1m06ywspgqj20r5p41xla-opensubdiv-3.5.0-dev/include -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/1np3p9y42nv1m06ywspgqj20r5p41xla-opensubdiv-3.5.0-dev/include -DCMAKE_INSTALL_SBINDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_OSX_SYSROOT= -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/strip -DCMAKE_RANLIB=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/ranlib -DCMAKE_AR=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0 -DNO_TUTORIALS=1 -DNO_REGRESSION=1 -DNO_EXAMPLES=1 -DNO_METAL=1 -DGLEW_INCLUDE_DIR=/nix/store/55n26bd7l2jdxj8fkh688nrv290d3hp8-glew-2.2.0-dev/include -DGLEW_LIBRARY=/nix/store/55n26bd7l2jdxj8fkh688nrv290d3hp8-glew-2.2.0-dev/lib -DOSD_CUDA_NVCC_FLAGS=--gpu-architecture=compute_37 -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin/cc -DNO_OPENCL=1 -DCUDA_TOOLKIT_ROOT_DIR=/nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin -DCMAKE_CUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin /m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin/cc -DNO_OPENCL=1 -DCUDA_TOOLKIT_ROOT_DIR=/nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin -DCMAKE_CUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin -- The C compiler identification is GNU 12.3.0 -- The CXX compiler identification is GNU 12.3.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/g++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Compiling OpenSubdiv version v3_5_0 -- Using cmake version 3.26.4 -- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5") -- Could NOT find TBB (missing: TBB_INCLUDE_DIR TBB_LIBRARIES) (Required is at least version "4.0") -- Found OpenGL: /nix/store/xibw0p5bj2z3a566mannk3vflb9f5fph-libGL-1.6.0/lib/libOpenGL.so -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDA: /nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 (found suitable version "11.8", minimum required is "4.0") -- Found X11: /nix/store/gz38plw089ri9k2lh7gzhh58ydhb3rv1-xorgproto-2023.2/include -- Looking for XOpenDisplay in /nix/store/igp21718s3sa932z7baqnhlc72v0zl0z-libX11-1.8.6/lib/libX11.so;/nix/store/4s3wrg560496dx3qx8gnvvjqz4hc9222-libXext-1.3.5/lib/libXext.so -- Looking for XOpenDisplay in /nix/store/igp21718s3sa932z7baqnhlc72v0zl0z-libX11-1.8.6/lib/libX11.so;/nix/store/4s3wrg560496dx3qx8gnvvjqz4hc9222-libXext-1.3.5/lib/libXext.so - found -- Looking for gethostbyname -- Looking for gethostbyname - found -- Looking for connect -- Looking for connect - found -- Looking for remove -- Looking for remove - found -- Looking for shmat -- Looking for shmat - found -- Could NOT find GLFW (missing: GLFW_INCLUDE_DIR GLFW_LIBRARIES) (Required is at least version "3.0.0") -- Could NOT find PTex (missing: PTEX_INCLUDE_DIR PTEX_LIBRARY) (Required is at least version "2.0") -- Could NOT find ZLIB (missing: ZLIB_LIBRARY ZLIB_INCLUDE_DIR) (Required is at least version "1.2") -- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) (Required is at least version "1.8.4") -- Could NOT find Docutils (missing: RST2HTML_EXECUTABLE DOCUTILS_VERSION) (Required is at least version "0.9") -- Found Python: /nix/store/9c03r86hcdn43dm3hsgjirifvyzfkhwh-python3-3.10.12/bin/python3.10 (found version "3.10.12") found components: Interpreter CMake Warning at CMakeLists.txt:430 (message): TBB was not found : support for TBB parallel compute kernels will be disabled in Osd. If your compiler supports TBB directives, please refer to the FindTBB.cmake shared module in your cmake installation. CMake Warning at CMakeLists.txt:619 (message): Ptex was not found : the OpenSubdiv Ptex example will not be available. If you do have Ptex installed and see this message, please add your Ptex path to FindPTex.cmake in /build/source/cmake or set it through the PTEX_LOCATION cmake command line argument or environment variable. CMake Warning at documentation/CMakeLists.txt:52 (message): Doxyen was not found : support for Doxygen automated API documentation is disabled. -- Configuring done (3.6s) -- Generating done (0.0s) CMake Warning: Manually-specified variables were not used by the project: BUILD_TESTING CMAKE_EXPORT_NO_PACKAGE_REGISTRY CMAKE_POLICY_DEFAULT_CMP0025 GLEW_LIBRARY -- Build files have been written to: /build/source/build cmake: enabled parallel building cmake: enabled parallel installing @nix { "action": "setPhase", "phase": "buildPhase" } building build flags: -j48 SHELL=/nix/store/a7f7xfp9wyghf44yv6l6fv9dfw492hd3-bash-5.2-p15/bin/bash ``` (Remainder of logs omitted)
davidgyu commented 1 year ago

Thanks for the additional information!

bonsairobo commented 2 weeks ago

I just hit this failure when building nixpkgs. The build succeeded on retry. Just making it known that the workaround is not a silver bullet.