dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.04k stars 433 forks source link

compiling container nvcr.io/nvidia/l4t-ml:r32.4.4-py3 - error with torchaudio #101

Open zubairahmed-ai opened 2 years ago

zubairahmed-ai commented 2 years ago

Hi @dusty-nv I am running ./scripts/docker_build_ml.sh all after scripts/docker_run.sh -c nvcr.io/nvidia/l4t-ml:r32.4.4-py3 but when it runs it gives following error with torchaudio or I think it is torchaudio This seems similar but not sure how to fix it here. I am stuck, please help

-- Caffe2: CUDA toolkit directory: /usr/local/cuda-10.2
-- Caffe2: Header version is: 10.2
-- Found CUDNN: /usr/lib/aarch64-linux-gnu/libcudnn.so
-- Found cuDNN: v8.0.0  (include: /usr/include, library: /usr/lib/aarch64-linux-gnu/libcudnn.so)
CMake Warning at /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:198 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:55 (find_package)

CMake Warning at /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:483 (torch_cuda_get_nvcc_gencode_flag)
  /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /usr/local/lib/python3.6/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:55 (find_package)

-- Added CUDA NVCC flags for: -gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_62,code=sm_62;-gencode;arch=compute_72,code=sm_72
-- Found Torch: /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
get_version.sh: 37: get_version.sh: Bad substitution
get_version.sh: The version number "5.5" specified in src/.version is not in MAJOR.MINOR format.
get_version.sh: Stopping the construction of full version number from git history.
get_version.sh: 45: get_version.sh: [[: not found
-- Configuring done
-- Generating done
-- Build files have been written to: /torchaudio/build/temp.linux-aarch64-3.6
[1/19] /usr/bin/c++   -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /usr/local/lib/python3.6/dist-packages/torch/include -isystem /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-10.2/include -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC   -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-math.cc.o -c ../../third_party/kaldi/submodule/src/base/kaldi-math.cc
[2/19] /usr/bin/c++   -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /usr/local/lib/python3.6/dist-packages/torch/include -isystem /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-10.2/include -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC   -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/base/kaldi-error.cc.o -c ../../third_party/kaldi/submodule/src/base/kaldi-error.cc
[3/19] /usr/bin/c++   -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /usr/local/lib/python3.6/dist-packages/torch/include -isystem /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-10.2/include -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC   -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o -c ../../third_party/kaldi/submodule/src/feat/pitch-functions.cc
FAILED: third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o
/usr/bin/c++   -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /usr/local/lib/python3.6/dist-packages/torch/include -isystem /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-10.2/include -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC   -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/pitch-functions.cc.o -c ../../third_party/kaldi/submodule/src/feat/pitch-functions.cc
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc: In member function 'void kaldi::OnlinePitchFeatureImpl::UpdateRemainder(const kaldi::VectorBase<float>&)':
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc:814:11: warning: unused variable 'full_frame_length' [-Wunused-variable]
     int32 full_frame_length = opts_.NccfWindowSize() + nccf_last_lag_;
           ^~~~~~~~~~~~~~~~~
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc: In member function 'void kaldi::OnlineProcessPitch::UpdateNormalizationStats(kaldi::int32)':
../../third_party/kaldi/submodule/src/feat/pitch-functions.cc:1504:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   if (normalization_stats_.size() <= frame)
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
[4/19] /usr/bin/c++   -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /usr/local/lib/python3.6/dist-packages/torch/include -isystem /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-10.2/include -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC   -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o -c ../../third_party/kaldi/submodule/src/feat/resample.cc
FAILED: third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o
/usr/bin/c++   -I../../third_party/kaldi/src -I../../third_party/kaldi/submodule/src -isystem /usr/local/lib/python3.6/dist-packages/torch/include -isystem /usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-10.2/include -Wall -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility=hidden -O3 -DNDEBUG -fPIC   -D_GLIBCXX_USE_CXX11_ABI=1 -std=gnu++14 -MD -MT third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o -MF third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o.d -o third_party/kaldi/CMakeFiles/kaldi.dir/submodule/src/feat/resample.cc.o -c ../../third_party/kaldi/submodule/src/feat/resample.cc
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
^C
^C^C
dusty-nv commented 2 years ago

c++: internal compiler error: Killed (program cc1plus)

Did your board run out of memory? Do you have extra swap memory mounted?

zubairahmed-ai commented 2 years ago

Not sure how to check it's a 4gb board, I am not sure how I can enable swap memory

zubairahmed-ai commented 2 years ago

I successfully built tensorflow container separately, but building pytorch one separately also gave above error but my assumption is that all switch builds everything it needs including tensorflow and pytorch

zubairahmed-ai commented 2 years ago

image

Even after increasing swap memory to 4Gb I think I am running out of memory and the builds gives the same error as above

zubairahmed-ai commented 2 years ago

Somehow the build exits gracefully now after increasing swap memory showing the massive text in red below

/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  setuptools.SetuptoolsDeprecationWarning,
/usr/local/lib/python3.6/dist-packages/setuptools/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  EasyInstallDeprecationWarning,
Traceback (most recent call last):
  File "setup.py", line 91, in <module>
    zip_safe=False,
  File "/usr/local/lib/python3.6/dist-packages/setuptools/__init__.py", line 159, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()
  File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py", line 116, in do_egg_install
    self.run_command('bdist_egg')
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.6/dist-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/usr/local/lib/python3.6/dist-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/usr/lib/python3.6/distutils/command/install_lib.py", line 109, in build
    self.run_command('build_ext')
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/torchaudio/build_tools/setup_helpers/extension.py", line 52, in run
    super().run()
  File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/usr/local/lib/python3.6/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run
    _build_ext.build_ext.run(self)
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
    self.build_extensions()
  File "/usr/local/lib/python3.6/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
    _build_ext.build_ext.build_extensions(self)
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
    self._build_extensions_serial()
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
    self.build_extension(ext)
  File "/torchaudio/build_tools/setup_helpers/extension.py", line 99, in build_extension
    ["cmake", "--build", "."] + build_args, cwd=self.build_temp)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install']' returned non-zero exit status 1.
dusty-nv commented 2 years ago

Hmm I'm not exactly sure what the actual error is there. If you don't need torchaudio, you could comment that section out of the dockerfile.

Is it possible for you to use one of the existing pre-built l4t-ml container images that are on NGC?

zubairahmed-ai commented 2 years ago

Hmm I'm not exactly sure what the actual error is there. If you don't need torchaudio, you could comment that section out of the dockerfile.

Is it possible for you to use one of the existing pre-built l4t-ml container images that are on NGC?

I actually did comment that out, wish you'd pointed this out earlier, right now I am stuck with following error after commenting torchaudio Yes its totally possible, let me know how to use those pre-built containers, I am following instructions in your readme

build/temp.linux-aarch64-3.6/scratch/vers.c:4:10: fatal error: zmq.h: No such file or directory
     #include "zmq.h"
              ^~~~~~~
    compilation terminated.

    error: command 'aarch64-linux-gnu-gcc' failed with exit status 1

    Warning: Couldn't find an acceptable libzmq on the system.
dusty-nv commented 2 years ago

Yes its totally possible, let me know how to use those pre-built containers, I am following instructions in your readme

At the top of the readme is a link to the containers on NGC which include instructions on how to run the pre-built images:

https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml

zubairahmed-ai commented 2 years ago

oh dear, thanks @dusty-nv will try this out