dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.19k stars 451 forks source link

try to build built container images failed #274

Open UserName-wang opened 1 year ago

UserName-wang commented 1 year ago

Hi there,

my environment is: Jetson AGX xavier running Jetpack 5.1.1, and L4T version R35.3.1,

command: ./build.sh --name=my_container ros:humble-desktop /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.16) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " Namespace(base='', build_flags='', list_packages=False, logs='', multiple=False, name='my_container', package_dirs=[''], packages=['ros:humble-desktop'], push='', show_packages=False, simulate=False, skip_errors=False, skip_packages=[''], skip_tests=['']) -- L4T_VERSION=35.3.1 -- JETPACK_VERSION=5.1.1 -- CUDA_VERSION=11.4.315 -- LSB_RELEASE=20.04 (focal) -- Loading /home/agx/study/nvidia/jetson-containers/packages/protobuf/protobuf_cpp/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/tensorflow/config.py -- Package small-stable-diffusion was disabled by its config -- Loading /home/agx/study/nvidia/jetson-containers/packages/l4t/l4t-text-generation/config.yml -- Loading /home/agx/study/nvidia/jetson-containers/packages/l4t/l4t-pytorch/l4t-pytorch.json -- Loading /home/agx/study/nvidia/jetson-containers/packages/l4t/l4t-tensorflow/l4t-tensorflow.json -- Loading /home/agx/study/nvidia/jetson-containers/packages/l4t/l4t-diffusion/config.yml -- Loading /home/agx/study/nvidia/jetson-containers/packages/tritonserver/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/pytorch/torchaudio/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/pytorch/torchvision/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/pytorch/torch2trt/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/pytorch/torch_tensorrt/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/pytorch/config.py -- Package pytorch:1.10 isn't compatible with L4T r35.3.1 (requires L4T ==32.) -- Package pytorch:1.10 was disabled by its config -- Package pytorch:1.9 isn't compatible with L4T r35.3.1 (requires L4T ==32.) -- Package pytorch:1.9 was disabled by its config -- Loading /home/agx/study/nvidia/jetson-containers/packages/riva-client/config.json -- Loading /home/agx/study/nvidia/jetson-containers/packages/deepstream/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/rapids/cuml/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/rapids/cudf/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/nemo/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/zed/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/opencv/opencv_builder/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/opencv/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/cuda-python/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/awq/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/text-generation-inference/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/xformers/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/exllama/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/llama_cpp/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/optimum/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/auto_gptq/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/llm/transformers/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/ros/config.py -- Package ros:melodic-ros-base isn't compatible with L4T r35.3.1 (requires L4T <34) -- Package ros:melodic-ros-base was disabled by its config -- Package ros:melodic-ros-core isn't compatible with L4T r35.3.1 (requires L4T <34) -- Package ros:melodic-ros-core was disabled by its config -- Package ros:melodic-desktop isn't compatible with L4T r35.3.1 (requires L4T <34) -- Package ros:melodic-desktop was disabled by its config -- Loading /home/agx/study/nvidia/jetson-containers/packages/onnxruntime/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/cupy/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/pycuda/config.py -- Loading /home/agx/study/nvidia/jetson-containers/packages/onnx/config.py -- Building containers ['build-essential', 'python', 'cmake', 'numpy', 'opencv', 'ros:humble-desktop'] -- Building container my_container:l4t-r35.3.1-build-essential

sudo docker build --network=host --tag my_container:l4t-r35.3.1-build-essential \ --file /home/agx/study/nvidia/jetson-containers/packages/build-essential/Dockerfile \ --build-arg BASE_IMAGE=nvcr.io/nvidia/l4t-jetpack:r35.3.1 \ /home/agx/study/nvidia/jetson-containers/packages/build-essential \ 2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230824_201519/build/my_container_l4t-r35.3.1-build-essential.txt; exit ${PIPESTATUS[0]}

0 building with "default" instance using docker driver

1 [internal] load .dockerignore

1 transferring context: 2B 0.0s done

1 DONE 0.0s

2 [internal] load build definition from Dockerfile

2 transferring dockerfile: 595B done

2 DONE 0.0s

3 [auth] nvidia/l4t-jetpack:pull,push token for nvcr.io

3 DONE 0.0s

4 [internal] load metadata for nvcr.io/nvidia/l4t-jetpack:r35.3.1

4 ERROR: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://nvcr.io/proxy_auth?scope=repository%3Anvidia%2Fl4t-jetpack%3Apull%2Cpush: 401


[internal] load metadata for nvcr.io/nvidia/l4t-jetpack:r35.3.1:

Dockerfile:7

5 | #--- 6 | ARG BASE_IMAGE 7 | >>> FROM ${BASE_IMAGE} 8 |
9 | ENV DEBIAN_FRONTEND=noninteractive

ERROR: failed to solve: nvcr.io/nvidia/l4t-jetpack:r35.3.1: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://nvcr.io/proxy_auth?scope=repository%3Anvidia%2Fl4t-jetpack%3Apull%2Cpush: 401 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/agx/study/nvidia/jetson-containers/jetson_containers/build.py", line 93, in build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.push) File "/home/agx/study/nvidia/jetson-containers/jetson_containers/container.py", line 119, in build_container status = subprocess.run(cmd.replace(NEWLINE, ' '), executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'sudo docker build --network=host --tag my_container:l4t-r35.3.1-build-essential --file /home/agx/study/nvidia/jetson-containers/packages/build-essential/Dockerfile --build-arg BASE_IMAGE=nvcr.io/nvidia/l4t-jetpack:r35.3.1 /home/agx/study/nvidia/jetson-containers/packages/build-essential 2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230824_201519/build/my_container_l4t-r35.3.1-build-essential.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.

Can someone please help me? thank you!

dusty-nv commented 1 year ago

@UserName-wang hmm, can you try doing a sudo docker pull nvcr.io/nvidia/l4t-jetpack:r35.3.1 first?

UserName-wang commented 1 year ago

@UserName-wang hmm, can you try doing a sudo docker pull nvcr.io/nvidia/l4t-jetpack:r35.3.1 first? it woks now! Thank you a lot for your help and quick reply!

UserName-wang commented 1 year ago

Dear @dusty-nv,

Now I have a new error about bitsandbytes:

9 [5/5] RUN pip3 show bitsandbytes && python3 -c 'import bitsandbytes'

9 1.970 Name: bitsandbytes

9 1.971 Version: 0.39.1

9 1.972 Summary: k-bit optimizers and matrix multiplication routines.

9 1.972 Home-page: https://github.com/TimDettmers/bitsandbytes

9 1.973 Author: Tim Dettmers

9 1.974 Author-email: dettmers@cs.washington.edu

9 1.974 License: MIT

9 1.975 Location: /usr/local/lib/python3.8/dist-packages

9 1.976 Requires:

9 1.976 Required-by:

9 8.551 /usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.

9 8.551 warn("The installed version of bitsandbytes was compiled without GPU support. "

9 8.552

9 8.552 ===================================BUG REPORT===================================

9 8.552 Welcome to bitsandbytes. For bug reports, please run

9 8.552

9 8.552 python -m bitsandbytes

9 8.552

9 8.552 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

9 8.552 ================================================================================

9 8.552 bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cpu.so

9 8.552 False

9 8.552 'NoneType' object has no attribute 'cadam32bit_grad_fp32'

9 8.552 CUDA SETUP: Required library version not found: libbitsandbytes_cpu.so. Maybe you need to compile it from source?

9 8.552 CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

9 8.552

9 8.552 ================================================ERROR=====================================

9 8.552 CUDA SETUP: CUDA detection failed! Possible reasons:

9 8.552 1. CUDA driver not installed

9 8.552 2. CUDA not installed

9 8.552 3. You have multiple conflicting CUDA libraries

9 8.552 4. Required library not pre-compiled for this bitsandbytes release!

9 8.552 CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.

9 8.552 CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.

9 8.552 ================================================================================

9 8.552

9 8.552 CUDA SETUP: Problem: The main issue seems to be that the main CUDA library was not detected.

9 8.552 CUDA SETUP: Solution 1): Your paths are probably not up-to-date. You can update them via: sudo ldconfig.

9 8.552 CUDA SETUP: Solution 2): If you do not have sudo rights, you can do the following:

9 8.552 CUDA SETUP: Solution 2a): Find the cuda library via: find / -name libcuda.so 2>/dev/null

9 8.552 CUDA SETUP: Solution 2b): Once the library is found add it to the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:FOUND_PATH_FROM_2a

9 8.552 CUDA SETUP: Solution 2c): For a permanent solution add the export from 2b into your .bashrc file, located at ~/.bashrc

9 8.552 CUDA SETUP: Setup Failed!

9 DONE 9.6s

10 exporting to image

10 exporting layers

10 exporting layers 0.9s done

10 writing image sha256:d590ef0eef5cf993ac8bf5d21f58a8e2c54ae40df4f53309e9b6eb69c7986638 done

10 naming to docker.io/library/my_container:r35.3.1-bitsandbytes done

10 DONE 0.9s

-- Testing container my_container:r35.3.1-bitsandbytes (bitsandbytes/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \ --volume /home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes:/test \ --volume /home/agx/study/nvidia/jetson-containers/data:/data \ --workdir /test \ my_container:r35.3.1-bitsandbytes \ /bin/bash -c 'python3 test.py' \ 2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_174248/test/my_container_r35.3.1-bitsandbytes_test.py.txt; exit ${PIPESTATUS[0]}

testing bitsandbytes...

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so False CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 7.2 CUDA SETUP: Detected CUDA version 114 /usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) CUDA SETUP: Required library version not found: libbitsandbytes_cuda114_nocublaslt.so. Maybe you need to compile it from source? CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR===================================== CUDA SETUP: CUDA detection failed! Possible reasons:

  1. CUDA driver not installed
  2. CUDA not installed
  3. You have multiple conflicting CUDA libraries
  4. Required library not pre-compiled for this bitsandbytes release! CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113. CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.

dusty-nv commented 1 year ago

@UserName-wang can you post the command you ran to start the build, and the entire build log?

For some reason, it is not detecting CUDA Toolkit...

UserName-wang commented 1 year ago

Dear @dusty-nv , command I used: ./build.sh --name=my_container pytorch transformers nemo

#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 512B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/my_container:r35.3.1-cmake
#3 DONE 0.0s

#4 [1/3] FROM docker.io/library/my_container:r35.3.1-cmake
#4 DONE 0.0s

#5 [2/3] RUN pip3 install --no-cache-dir --verbose git+https://github.com/onnx/onnx@main
#5 CACHED

#6 [3/3] RUN pip3 show onnx && python3 -c 'import onnx; print(onnx.__version__)'
#6 CACHED

#7 exporting to image
#7 exporting layers done
#7 writing image sha256:e0150313e6a46d8ebcf12e3df6b664c7fd0a3f4b5f2b4b0c55d39e1c37e058c9 done
#7 naming to docker.io/library/my_container:r35.3.1-onnx done
#7 DONE 0.0s
-- Testing container my_container:r35.3.1-onnx (onnx/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/agx/study/nvidia/jetson-containers/packages/onnx:/test \
--volume /home/agx/study/nvidia/jetson-containers/data:/data \
--workdir /test \
my_container:r35.3.1-onnx \
/bin/bash -c 'python3 test.py' \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/test/my_container_r35.3.1-onnx_test.py.txt; exit ${PIPESTATUS[0]}

testing onnx...
onnx version: 1.15.0
onnx OK

-- Building container my_container:r35.3.1-pytorch

sudo docker build --network=host --tag my_container:r35.3.1-pytorch \
--file /home/agx/study/nvidia/jetson-containers/packages/pytorch/Dockerfile \
--build-arg BASE_IMAGE=my_container:r35.3.1-onnx \
--build-arg PYTORCH_WHL="torch-2.0.0+nv23.05-cp38-cp38-linux_aarch64.whl" \
--build-arg PYTORCH_URL="https://nvidia.box.com/shared/static/i8pukc49h3lhak4kkn67tg9j4goqm0m7.whl" \
/home/agx/study/nvidia/jetson-containers/packages/pytorch \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/build/my_container_r35.3.1-pytorch.txt; exit ${PIPESTATUS[0]}

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 1.87kB 0.0s done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B 0.0s done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/my_container:r35.3.1-onnx
#3 DONE 0.0s

#4 [1/6] FROM docker.io/library/my_container:r35.3.1-onnx
#4 DONE 0.0s

#5 [2/6] RUN apt-get update &&     apt-get install -y --no-install-recommends         libopenblas-dev         libopenmpi-dev             openmpi-bin             openmpi-common           gfortran        libomp-dev     && rm -rf /var/lib/apt/lists/*     && apt-get clean
#5 CACHED

#6 [5/6] RUN PYTHON_ROOT=`pip3 show torch | grep Location: | cut -d' ' -f2` &&     TORCH_CMAKE_CONFIG=$PYTHON_ROOT/torch/share/cmake/Torch/TorchConfig.cmake &&     echo "patching _GLIBCXX_USE_CXX11_ABI in ${TORCH_CMAKE_CONFIG}" &&     sed -i 's/  set(TORCH_CXX_FLAGS "-D_GLIBCXX_USE_CXX11_ABI=")/  set(TORCH_CXX_FLAGS "-D_GLIBCXX_USE_CXX11_ABI=0")/g' ${TORCH_CMAKE_CONFIG}
#6 CACHED

#7 [4/6] RUN python3 -c 'import torch; print(f"PyTorch version: {torch.__version__}"); print(f"CUDA available:  {torch.cuda.is_available()}"); print(f"cuDNN version:   {torch.backends.cudnn.version()}"); print(torch.__config__.show());'
#7 CACHED

#8 [3/6] RUN cd /opt &&     wget --quiet --show-progress --progress=bar:force:noscroll --no-check-certificate https://nvidia.box.com/shared/static/i8pukc49h3lhak4kkn67tg9j4goqm0m7.whl -O torch-2.0.0+nv23.05-cp38-cp38-linux_aarch64.whl &&     pip3 install --verbose torch-2.0.0+nv23.05-cp38-cp38-linux_aarch64.whl
#8 CACHED

#9 [6/6] RUN pip3 install --no-cache-dir scikit-build &&     pip3 install --no-cache-dir ninja
#9 CACHED

#10 exporting to image
#10 exporting layers done
#10 writing image sha256:dd39d49070b8d8464ce9af0a2aaf4295b0ed03c396d57fcf6842ada673488db9 done
#10 naming to docker.io/library/my_container:r35.3.1-pytorch done
#10 DONE 0.0s
-- Testing container my_container:r35.3.1-pytorch (pytorch:2.0/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/agx/study/nvidia/jetson-containers/packages/pytorch:/test \
--volume /home/agx/study/nvidia/jetson-containers/data:/data \
--workdir /test \
my_container:r35.3.1-pytorch \
/bin/bash -c 'python3 test.py' \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/test/my_container_r35.3.1-pytorch_test.py.txt; exit ${PIPESTATUS[0]}

testing PyTorch...
PyTorch version: 2.0.0+nv23.05
CUDA available:  True
cuDNN version:   8600
PyTorch built with:
  - GCC 9.4
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - CUDA Runtime 11.4
  - NVCC architecture flags: -gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_87,code=sm_87
  - CuDNN 8.6
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=11.4, CUDNN_VERSION=8.6.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=0, USE_NNPACK=1, USE_OPENMP=ON, USE_ROCM=OFF, 

2.0.0+nv23.5
Tensor a = tensor([0., 0.], device='cuda:0')
Tensor b = tensor([-0.9102,  0.5022], device='cuda:0')
Tensor c = tensor([-0.9102,  0.5022], device='cuda:0')
testing LAPACK (OpenBLAS)...
done testing LAPACK (OpenBLAS)
testing torch.nn (cuDNN)...
done testing torch.nn (cuDNN)
testing CPU tensor vector operations...
test.py:58: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  cpu_y = F.softmax(cpu_x)
Tensor cpu_x = tensor([12.3450])
Tensor softmax = tensor([1.])
Tensor exp (float32) = tensor([[2.7183, 2.7183, 2.7183],
        [2.7183, 2.7183, 2.7183],
        [2.7183, 2.7183, 2.7183]])
Tensor exp (float64) = tensor([[2.7183, 2.7183, 2.7183],
        [2.7183, 2.7183, 2.7183],
        [2.7183, 2.7183, 2.7183]], dtype=torch.float64)
Tensor exp (diff) = 7.429356050359104e-07
PyTorch OK

-- Building container my_container:r35.3.1-torchvision

sudo docker build --network=host --tag my_container:r35.3.1-torchvision \
--file /home/agx/study/nvidia/jetson-containers/packages/pytorch/torchvision/Dockerfile \
--build-arg BASE_IMAGE=my_container:r35.3.1-pytorch \
--build-arg TORCHVISION_VERSION="v0.15.1" \
--build-arg TORCH_CUDA_ARCH_LIST="7.2;8.7" \
/home/agx/study/nvidia/jetson-containers/packages/pytorch/torchvision \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/build/my_container_r35.3.1-torchvision.txt; exit ${PIPESTATUS[0]}

#0 building with "default" instance using docker driver

#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 1.23kB done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/my_container:r35.3.1-pytorch
#3 DONE 0.0s

#4 [1/5] FROM docker.io/library/my_container:r35.3.1-pytorch
#4 DONE 0.0s

#5 [4/5] RUN git clone --branch v0.15.1 --recursive --depth=1 https://github.com/pytorch/vision torchvision &&     cd torchvision &&     git checkout v0.15.1 &&     python3 setup.py bdist_wheel &&     cp dist/torchvision*.whl /opt &&     pip3 install --no-cache-dir --verbose /opt/torchvision*.whl &&     cd ../ &&     rm -rf torchvision
#5 CACHED

#6 [2/5] RUN printenv && echo "torchvision version = v0.15.1" && echo "TORCH_CUDA_ARCH_LIST = 7.2;8.7"
#6 CACHED

#7 [3/5] RUN apt-get update &&     apt-get install -y --no-install-recommends             libjpeg-dev         zlib1g-dev     && rm -rf /var/lib/apt/lists/*     && apt-get clean
#7 CACHED

#8 [5/5] RUN python3 -c 'import torchvision; print(torchvision.__version__);'
#8 CACHED

#9 exporting to image
#9 exporting layers done
#9 writing image sha256:058294389642a20461b808775840f30cfd19d1a01afab061a464390fb6014732 0.0s done
#9 naming to docker.io/library/my_container:r35.3.1-torchvision done
#9 DONE 0.0s
-- Testing container my_container:r35.3.1-torchvision (torchvision/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/agx/study/nvidia/jetson-containers/packages/pytorch/torchvision:/test \
--volume /home/agx/study/nvidia/jetson-containers/data:/data \
--workdir /test \
my_container:r35.3.1-torchvision \
/bin/bash -c 'python3 test.py' \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/test/my_container_r35.3.1-torchvision_test.py.txt; exit ${PIPESTATUS[0]}

testing torchvision...
torchvision version: 0.15.1a0+42759b1

testing torchvision extensions...
torchvision classification models: alexnet | convnext_base | convnext_large | convnext_small | convnext_tiny | densenet121 | densenet161 | densenet169 | densenet201 | efficientnet_b0 | efficientnet_b1 | efficientnet_b2 | efficientnet_b3 | efficientnet_b4 | efficientnet_b5 | efficientnet_b6 | efficientnet_b7 | efficientnet_v2_l | efficientnet_v2_m | efficientnet_v2_s | get_model | get_model_builder | get_model_weights | get_weight | googlenet | inception_v3 | list_models | maxvit_t | mnasnet0_5 | mnasnet0_75 | mnasnet1_0 | mnasnet1_3 | mobilenet_v2 | mobilenet_v3_large | mobilenet_v3_small | regnet_x_16gf | regnet_x_1_6gf | regnet_x_32gf | regnet_x_3_2gf | regnet_x_400mf | regnet_x_800mf | regnet_x_8gf | regnet_y_128gf | regnet_y_16gf | regnet_y_1_6gf | regnet_y_32gf | regnet_y_3_2gf | regnet_y_400mf | regnet_y_800mf | regnet_y_8gf | resnet101 | resnet152 | resnet18 | resnet34 | resnet50 | resnext101_32x8d | resnext101_64x4d | resnext50_32x4d | shufflenet_v2_x0_5 | shufflenet_v2_x1_0 | shufflenet_v2_x1_5 | shufflenet_v2_x2_0 | squeezenet1_0 | squeezenet1_1 | swin_b | swin_s | swin_t | swin_v2_b | swin_v2_s | swin_v2_t | vgg11 | vgg11_bn | vgg13 | vgg13_bn | vgg16 | vgg16_bn | vgg19 | vgg19_bn | vit_b_16 | vit_b_32 | vit_h_14 | vit_l_16 | vit_l_32 | wide_resnet101_2 | wide_resnet50_2

Namespace(batch_size=8, data_tar='ILSVRC2012_img_val_subset_5k.tar.gz', data_url='https://nvidia.box.com/shared/static/y1ygiahv8h75yiyh0pt50jqdqt7pohgx.gz', models=['resnet18'], print_freq=25, resolution=224, test_threshold=-10.0, use_cuda=True, workers=2)
using CUDA
dataset classes: 1000
dataset images:  5000
batch size:      8

---------------------------------------------
-- resnet18
---------------------------------------------
loading model 'resnet18'
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
loaded model 'resnet18'

resnet18 [  0/625]  Time  3.409 ( 3.409)  Acc@1  75.00 ( 75.00)  Acc@5 100.00 (100.00)
resnet18 [ 25/625]  Time  0.013 ( 0.175)  Acc@1  62.50 ( 79.81)  Acc@5  87.50 ( 94.71)
resnet18 [ 50/625]  Time  0.088 ( 0.115)  Acc@1 100.00 ( 73.53)  Acc@5 100.00 ( 91.91)
resnet18 [ 75/625]  Time  0.044 ( 0.093)  Acc@1  62.50 ( 76.48)  Acc@5  87.50 ( 92.27)
resnet18 [100/625]  Time  0.014 ( 0.082)  Acc@1  87.50 ( 78.22)  Acc@5 100.00 ( 93.07)
resnet18 [125/625]  Time  0.079 ( 0.075)  Acc@1  87.50 ( 77.38)  Acc@5 100.00 ( 93.35)
resnet18 [150/625]  Time  0.020 ( 0.071)  Acc@1  12.50 ( 76.49)  Acc@5 100.00 ( 93.38)
resnet18 [175/625]  Time  0.084 ( 0.068)  Acc@1  62.50 ( 76.42)  Acc@5 100.00 ( 93.75)
resnet18 [200/625]  Time  0.015 ( 0.066)  Acc@1 100.00 ( 76.55)  Acc@5 100.00 ( 93.72)
resnet18 [225/625]  Time  0.095 ( 0.065)  Acc@1  87.50 ( 76.77)  Acc@5 100.00 ( 93.58)
resnet18 [250/625]  Time  0.081 ( 0.064)  Acc@1  37.50 ( 76.64)  Acc@5  62.50 ( 93.63)
resnet18 [275/625]  Time  0.023 ( 0.062)  Acc@1  87.50 ( 75.72)  Acc@5 100.00 ( 93.25)
resnet18 [300/625]  Time  0.082 ( 0.062)  Acc@1  50.00 ( 74.38)  Acc@5  87.50 ( 92.07)
resnet18 [325/625]  Time  0.012 ( 0.060)  Acc@1  75.00 ( 73.27)  Acc@5 100.00 ( 91.53)
resnet18 [350/625]  Time  0.070 ( 0.060)  Acc@1  62.50 ( 72.69)  Acc@5  87.50 ( 91.10)
resnet18 [375/625]  Time  0.021 ( 0.059)  Acc@1  25.00 ( 72.64)  Acc@5  62.50 ( 90.89)
resnet18 [400/625]  Time  0.076 ( 0.059)  Acc@1  75.00 ( 71.95)  Acc@5  87.50 ( 90.27)
resnet18 [425/625]  Time  0.013 ( 0.058)  Acc@1  62.50 ( 71.33)  Acc@5  87.50 ( 89.91)
resnet18 [450/625]  Time  0.111 ( 0.057)  Acc@1  75.00 ( 71.26)  Acc@5  87.50 ( 89.99)
resnet18 [475/625]  Time  0.044 ( 0.057)  Acc@1  37.50 ( 70.75)  Acc@5  75.00 ( 89.63)
resnet18 [500/625]  Time  0.057 ( 0.056)  Acc@1  75.00 ( 70.36)  Acc@5  87.50 ( 89.25)
resnet18 [525/625]  Time  0.018 ( 0.056)  Acc@1  62.50 ( 69.96)  Acc@5  87.50 ( 89.02)
resnet18 [550/625]  Time  0.065 ( 0.056)  Acc@1 100.00 ( 69.69)  Acc@5 100.00 ( 88.77)
resnet18 [575/625]  Time  0.016 ( 0.056)  Acc@1  75.00 ( 69.34)  Acc@5  87.50 ( 88.54)
resnet18 [600/625]  Time  0.016 ( 0.055)  Acc@1  37.50 ( 69.70)  Acc@5  87.50 ( 88.71)

resnet18
   * Acc@1 69.760  Expected 69.760   Delta -0.000
   * Acc@5 88.760  Expected 89.080   Delta -0.320
   * Images/sec  145.023
   * PASS

---------------------------------------------
-- Summary
---------------------------------------------

resnet18
   * Acc@1 69.760  Expected 69.760   Delta -0.000
   * Acc@5 88.760  Expected 89.080   Delta -0.320
   * Images/sec  145.023
   * PASS

Model tests passing:  1 / 1
torchvision OK

-- Building container my_container:r35.3.1-huggingface_hub

sudo docker build --network=host --tag my_container:r35.3.1-huggingface_hub \
--file /home/agx/study/nvidia/jetson-containers/packages/llm/huggingface_hub/Dockerfile \
--build-arg BASE_IMAGE=my_container:r35.3.1-torchvision \
/home/agx/study/nvidia/jetson-containers/packages/llm/huggingface_hub \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/build/my_container_r35.3.1-huggingface_hub.txt; exit ${PIPESTATUS[0]}

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 877B 0.0s done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B 0.0s done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/my_container:r35.3.1-torchvision
#3 DONE 0.0s

#4 [1/6] FROM docker.io/library/my_container:r35.3.1-torchvision
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 89B done
#5 DONE 0.0s

#6 [3/6] RUN pip3 install --no-cache-dir --verbose dataclasses
#6 CACHED

#7 [5/6] COPY huggingface-downloader.py /usr/local/bin/_huggingface-downloader.py
#7 CACHED

#8 [2/6] RUN pip3 install --no-cache-dir --verbose huggingface_hub[cli]
#8 CACHED

#9 [4/6] COPY huggingface-downloader /usr/local/bin/
#9 CACHED

#10 [6/6] RUN huggingface-cli --help &&     huggingface-downloader --help &&     pip3 show huggingface_hub &&     python3 -c 'import huggingface_hub; print(huggingface_hub.__version__)'
#10 CACHED

#11 exporting to image
#11 exporting layers done
#11 writing image sha256:b09017a64d7f6b45dc781a242c45ea01ed6c51e668173a8abe65ecd1872a47a8 done
#11 naming to docker.io/library/my_container:r35.3.1-huggingface_hub done
#11 DONE 0.0s
-- Testing container my_container:r35.3.1-huggingface_hub (huggingface_hub/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/agx/study/nvidia/jetson-containers/packages/llm/huggingface_hub:/test \
--volume /home/agx/study/nvidia/jetson-containers/data:/data \
--workdir /test \
my_container:r35.3.1-huggingface_hub \
/bin/bash -c 'python3 test.py' \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/test/my_container_r35.3.1-huggingface_hub_test.py.txt; exit ${PIPESTATUS[0]}

testing huggingface_hub...
huggingface_hub version: 0.16.4
huggingface_hub OK

-- Building container my_container:r35.3.1-rust

sudo docker build --network=host --tag my_container:r35.3.1-rust \
--file /home/agx/study/nvidia/jetson-containers/packages/rust/Dockerfile \
--build-arg BASE_IMAGE=my_container:r35.3.1-huggingface_hub \
/home/agx/study/nvidia/jetson-containers/packages/rust \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/build/my_container_r35.3.1-rust.txt; exit ${PIPESTATUS[0]}

#0 building with "default" instance using docker driver

#1 [internal] load .dockerignore
#1 transferring context: 0.0s
#1 transferring context: 2B 0.0s done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 317B 0.0s done
#2 DONE 0.1s

#3 [internal] load metadata for docker.io/library/my_container:r35.3.1-huggingface_hub
#3 DONE 0.0s

#4 [1/3] FROM docker.io/library/my_container:r35.3.1-huggingface_hub
#4 DONE 0.0s

#5 [2/3] RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
#5 CACHED

#6 [3/3] RUN rustc --version &&     pip3 install --no-cache-dir --verbose setuptools-rust
#6 CACHED

#7 exporting to image
#7 exporting layers done
#7 writing image sha256:5d8a1fe7bca2ac266e769dc10550470eb16cfecc240f5a13fbc86e1caf2922ad 0.0s done
#7 naming to docker.io/library/my_container:r35.3.1-rust 0.0s done
#7 DONE 0.0s
-- Building container my_container:r35.3.1-bitsandbytes

sudo docker build --network=host --tag my_container:r35.3.1-bitsandbytes \
--file /home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes/Dockerfile \
--build-arg BASE_IMAGE=my_container:r35.3.1-rust \
/home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/build/my_container_r35.3.1-bitsandbytes.txt; exit ${PIPESTATUS[0]}

#0 building with "default" instance using docker driver

#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 1.21kB done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/my_container:r35.3.1-rust
#3 DONE 0.0s

#4 [1/5] FROM docker.io/library/my_container:r35.3.1-rust
#4 DONE 0.0s

#5 https://api.github.com/repos/dusty-nv/bitsandbytes/git/refs/heads/main
#5 DONE 0.5s

#6 [2/5] ADD https://api.github.com/repos/dusty-nv/bitsandbytes/git/refs/heads/main /tmp/bitsandbytes_version.json
#6 CACHED

#7 [3/5] RUN pip3 uninstall -y bitsandbytes &&     cd /opt &&     git clone --depth=1 https://github.com/dusty-nv/bitsandbytes bitsandbytes &&     cd bitsandbytes &&     make CUDA_VERSION=114 -j$(nproc) cuda11x &&     python3 setup.py --verbose build_ext --inplace -j$(nproc) bdist_wheel &&     cp dist/bitsandbytes*.whl /opt &&     pip3 install --no-cache-dir --verbose /opt/bitsandbytes*.whl  &&     cd ../ &&     rm -rf bitsandbytes
#7 CACHED

#8 [4/5] RUN pip3 install --no-cache-dir --verbose scipy
#8 CACHED

#9 [5/5] RUN pip3 show bitsandbytes && python3 -c 'import bitsandbytes'
#9 CACHED

#10 exporting to image
#10 exporting layers done
#10 writing image sha256:d590ef0eef5cf993ac8bf5d21f58a8e2c54ae40df4f53309e9b6eb69c7986638 done
#10 naming to docker.io/library/my_container:r35.3.1-bitsandbytes done
#10 DONE 0.0s
-- Testing container my_container:r35.3.1-bitsandbytes (bitsandbytes/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes:/test \
--volume /home/agx/study/nvidia/jetson-containers/data:/data \
--workdir /test \
my_container:r35.3.1-bitsandbytes \
/bin/bash -c 'python3 test.py' \
2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/test/my_container_r35.3.1-bitsandbytes_test.py.txt; exit ${PIPESTATUS[0]}

testing bitsandbytes...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
False
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.2
CUDA SETUP: Detected CUDA version 114
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Required library version not found: libbitsandbytes_cuda114_nocublaslt.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:
1. CUDA driver not installed
2. CUDA not installed
3. You have multiple conflicting CUDA libraries
4. Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=114 make cuda11x_nomatmul
python setup.py install
CUDA SETUP: Setup Failed!
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    import bitsandbytes
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/__init__.py", line 6, in <module>
    from . import cuda_setup, utils, research
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/__init__.py", line 1, in <module>
    from . import nn
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
    from .modules import LinearFP8Mixed, LinearFP8Global
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
    from bitsandbytes.optim import GlobalOptimManager
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
    from bitsandbytes.cextension import COMPILED_WITH_CUDA
  File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in <module>
    raise RuntimeError('''
RuntimeError: 
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/agx/study/nvidia/jetson-containers/jetson_containers/build.py", line 93, in <module>
    build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.push)
  File "/home/agx/study/nvidia/jetson-containers/jetson_containers/container.py", line 125, in build_container
    test_container(container_name, pkg, simulate)
  File "/home/agx/study/nvidia/jetson-containers/jetson_containers/container.py", line 295, in test_container
    status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sudo docker run -t --rm --runtime=nvidia --network=host --volume /home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes:/test --volume /home/agx/study/nvidia/jetson-containers/data:/data --workdir /test my_container:r35.3.1-bitsandbytes /bin/bash -c 'python3 test.py' 2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230825_221528/test/my_container_r35.3.1-bitsandbytes_test.py.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.
dusty-nv commented 1 year ago

@UserName-wang unfortunately that log doesn't capture the actual build, because bitsandbytes was already built and it's just showing the cached output. My guess is that it didn't find CUDA or something for some reason. You could try ./build.sh bitsandbytes to try building/testing just bitsandbytes (paste the log from that here). Or you can run build.sh with --build-flags='--no-cache'

Also, your docker output looks different than mine - are you using buildkit or something? This is what my sudo docker version shows:

sudo docker version
Client:
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.1
 Git commit:        20.10.21-0ubuntu1~20.04.2
 Built:             Thu Apr 27 05:56:44 2023
 OS/Arch:           linux/arm64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.1
  Git commit:       20.10.21-0ubuntu1~20.04.2
  Built:            Thu Apr 27 05:37:01 2023
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.6.12-0ubuntu1~20.04.3
  GitCommit:
 nvidia:
  Version:          1.1.4-0ubuntu1~20.04.3
  GitCommit:        629a689
 docker-init:
  Version:          0.19.0
  GitCommit:
UserName-wang commented 1 year ago

@dusty-nv , below are my docker information, I guess it's automatically upgraded from your version. I tried commands: ./build.sh bitsandbytes and ./build.sh --build-flags='--no-cache' , almost the same error. but I cannot upload all of the log information. because the log content are too long.

Client: Docker Engine - Community Version: 24.0.5 API version: 1.43 Go version: go1.20.6 Git commit: ced0996 Built: Fri Jul 21 20:35:47 2023 OS/Arch: linux/arm64 Context: default

Server: Docker Engine - Community Engine: Version: 24.0.5 API version: 1.43 (minimum version 1.12) Go version: go1.20.6 Git commit: a61e2b4 Built: Fri Jul 21 20:35:47 2023 OS/Arch: linux/arm64 Experimental: false containerd: Version: 1.6.22 GitCommit: 8165feabfdfe38c65b599c4993d227328c231fca nvidia: Version: 1.1.8 GitCommit: v1.1.8-0-g82f18fe docker-init: Version: 0.19.0 GitCommit: de40ad0

error after command: ./build.sh bitsandbytes --build-flags='--no-cache'

9 [5/5] RUN pip3 show bitsandbytes && python3 -c 'import bitsandbytes'

9 1.403 Name: bitsandbytes

9 1.404 Version: 0.39.1

9 1.404 Summary: k-bit optimizers and matrix multiplication routines.

9 1.405 Home-page: https://github.com/TimDettmers/bitsandbytes

9 1.406 Author: Tim Dettmers

9 1.407 Author-email: dettmers@cs.washington.edu

9 1.407 License: MIT

9 1.408 Location: /usr/local/lib/python3.8/dist-packages

9 1.409 Requires:

9 1.409 Required-by:

9 6.840

9 6.840 ===================================BUG REPORT===================================

9 6.840 Welcome to bitsandbytes. For bug reports, please run

9 6.840

9 6.840 python -m bitsandbytes

9 6.840

9 6.840 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

9 6.840 ================================================================================

9 6.840 bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cpu.so

9 6.840 False

9 6.840 'NoneType' object has no attribute 'cadam32bit_grad_fp32'

9 6.840 CUDA SETUP: Required library version not found: libbitsandbytes_cpu.so. Maybe you need to compile it from source?

9 6.840 CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

9 6.840

9 6.840 ================================================ERROR=====================================

9 6.840 CUDA SETUP: CUDA detection failed! Possible reasons:

9 6.840 1. CUDA driver not installed

9 6.840 2. CUDA not installed

9 6.840 3. You have multiple conflicting CUDA libraries

9 6.840 4. Required library not pre-compiled for this bitsandbytes release!

9 6.840 CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.

9 6.840 CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.

9 6.840 ================================================================================

9 6.840

9 6.840 CUDA SETUP: Problem: The main issue seems to be that the main CUDA library was not detected.

9 6.840 CUDA SETUP: Solution 1): Your paths are probably not up-to-date. You can update them via: sudo ldconfig.

9 6.840 CUDA SETUP: Solution 2): If you do not have sudo rights, you can do the following:

9 6.840 CUDA SETUP: Solution 2a): Find the cuda library via: find / -name libcuda.so 2>/dev/null

9 6.840 CUDA SETUP: Solution 2b): Once the library is found add it to the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:FOUND_PATH_FROM_2a

9 6.840 CUDA SETUP: Solution 2c): For a permanent solution add the export from 2b into your .bashrc file, located at ~/.bashrc

9 6.840 CUDA SETUP: Setup Failed!

9 6.840 /usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.

9 6.840 warn("The installed version of bitsandbytes was compiled without GPU support. "

9 DONE 7.7s

10 exporting to image

10 exporting layers

10 exporting layers 0.8s done

10 writing image sha256:8517102594939d2323f18851e32057c6624a5a9b4bbab4deacd1d8391f29d0a1 done

10 naming to docker.io/library/bitsandbytes:r35.3.1-bitsandbytes done

10 DONE 0.8s

-- Testing container bitsandbytes:r35.3.1-bitsandbytes (bitsandbytes/test.py)

sudo docker run -t --rm --runtime=nvidia --network=host \ --volume /home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes:/test \ --volume /home/agx/study/nvidia/jetson-containers/data:/data \ --workdir /test \ bitsandbytes:r35.3.1-bitsandbytes \ /bin/bash -c 'python3 test.py' \ 2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230826_141319/test/bitsandbytes_r35.3.1-bitsandbytes_test.py.txt; exit ${PIPESTATUS[0]}

testing bitsandbytes...

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so False CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 7.2 CUDA SETUP: Detected CUDA version 114 /usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) CUDA SETUP: Required library version not found: libbitsandbytes_cuda114_nocublaslt.so. Maybe you need to compile it from source? CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR===================================== CUDA SETUP: CUDA detection failed! Possible reasons:

  1. CUDA driver not installed
  2. CUDA not installed
  3. You have multiple conflicting CUDA libraries
  4. Required library not pre-compiled for this bitsandbytes release! CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113. CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.

CUDA SETUP: Something unexpected happened. Please compile from source: git clone git@github.com:TimDettmers/bitsandbytes.git cd bitsandbytes CUDA_VERSION=114 make cuda11x_nomatmul python setup.py install CUDA SETUP: Setup Failed! Traceback (most recent call last): File "test.py", line 4, in import bitsandbytes File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/init.py", line 6, in from . import cuda_setup, utils, research File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/init.py", line 1, in from . import nn File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/init.py", line 1, in from .modules import LinearFP8Mixed, LinearFP8Global File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in from bitsandbytes.optim import GlobalOptimManager File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/init.py", line 6, in from bitsandbytes.cextension import COMPILED_WITH_CUDA File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in raise RuntimeError(''' RuntimeError: CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/agx/study/nvidia/jetson-containers/jetson_containers/build.py", line 93, in build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.push) File "/home/agx/study/nvidia/jetson-containers/jetson_containers/container.py", line 125, in build_container test_container(container_name, pkg, simulate) File "/home/agx/study/nvidia/jetson-containers/jetson_containers/container.py", line 295, in test_container status = subprocess.run(cmd.replace(NEWLINE, ' '), executable='/bin/bash', shell=True, check=True) File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'sudo docker run -t --rm --runtime=nvidia --network=host --volume /home/agx/study/nvidia/jetson-containers/packages/llm/bitsandbytes:/test --volume /home/agx/study/nvidia/jetson-containers/data:/data --workdir /test bitsandbytes:r35.3.1-bitsandbytes /bin/bash -c 'python3 test.py' 2>&1 | tee /home/agx/study/nvidia/jetson-containers/logs/20230826_141319/test/bitsandbytes_r35.3.1-bitsandbytes_test.py.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.

dusty-nv commented 1 year ago

@UserName-wang run ./build.sh bitsandbytes --build-flags='--no-cache' | tee logs/bitsandbytes.txt and then attach the log file here. I can't tell why it doesn't build with CUDA without the full log.

Have you set your default docker-runtime to nvidia like here? https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#docker-default-runtime

UserName-wang commented 1 year ago

Dear @dusty-nv , thank you for your patience! Here is my /etc/docker/daemon.json: { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia", "insecure-registries": ["nas:5555"] }

and here is the error log file after command: ./build.sh bitsandbytes --build-flags='--no-cache' | tee logs/bitsandbytes.txt

bitsandbytes.txt