ceccocats / tkDNN

Deep neural network library and toolkit to do high performace inference on NVIDIA jetson platforms
GNU General Public License v2.0
718 stars 209 forks source link

Running tkDNN in a container inside Jetson Xavier AGX #221

Closed iripatx closed 3 years ago

iripatx commented 3 years ago

I want to say sorry in advance if this is more of a theoretical misunderstanding on my part, since I'm only starting to work with Jetson devices.

We want to run some tests using tkDNN in a docker container inside the jetson. I tried using the pre-built docker image, but I receive the following error:

standard_init_linux.go:211: exec user process caused “exec format error”

After some research, I found out that error might be caused by the image's architecture (amd64) not being the jetson's CPU architecture. I'm wondering if I'm missing something on the configuration, or if the docker images are simply made to test the library on x86 devices.

Thank you for your time.

iripatx commented 3 years ago

So after more searching I found out this is a duplicate of https://github.com/ceccocats/tkDNN/issues/127.

I'm closing the issue myself. Sorry for the confusion.

iripatx commented 3 years ago

I'll document my solution just in case anyone searches for the same problem.

I ended up using the NVIDIA L4T ML docker image. I extended it a bit to add some tools (make, cmake, yaml...) and ran it using the Jetpack's Container Runtime. You can also install ROS if you need.

I'm aware that this image contains many tools that are not needed. I picked it to do some quick tests,though it would be better to take the L4T base image and extend it.

hlacikd commented 3 years ago

I'm aware that this image contains many tools that are not needed. I picked it to do some quick tests,though it would be better to take the L4T base image and extend it.

I do use l4t-base , dockerfile looks like this


ARG BUILD_IMAGE=nvcr.io/nvidia/l4t-base:r32.4.4
ARG BASE_IMAGE=${BUILD_IMAGE}

FROM ${BUILD_IMAGE} as builder

RUN apt-get update \
    && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    build-essential cmake git ninja-build \
    libgtk-3-dev python3-dev python3-numpy \
    ca-certificates file \
    libeigen3-dev libyaml-cpp-dev libssl-dev \
    #
    # Clean up
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# CMAKE
WORKDIR /usr/local/src
ARG CTAG=v3.18.4
RUN git clone --depth 1 --branch ${CTAG} https://github.com/Kitware/CMake.git \
    && mkdir cmake_build

WORKDIR /usr/local/src/cmake_build
RUN cmake \
    -G Ninja \
    /usr/local/src/CMake
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc)

# OPENCV
# https://docs.opencv.org/master/d2/de6/tutorial_py_setup_in_ubuntu.html
WORKDIR /usr/local/src
ARG CVTAG=4.5.0
RUN git clone --depth 1 --branch ${CVTAG} https://github.com/opencv/opencv.git \
    && git clone --depth 1 --branch ${CVTAG} https://github.com/opencv/opencv_contrib.git \
    && mkdir opencv_build

WORKDIR /usr/local/src/opencv_build
RUN cmake \
    -G Ninja \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN='5.3 7.2' \
    -D CUDA_FAST_MATH=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D OPENCV_EXTRA_MODULES_PATH=/usr/local/src/opencv_contrib/modules \
    /usr/local/src/opencv
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc) \
    && ninja package -j$(nproc)

# TKDNN
WORKDIR /usr/local/src
ARG TTAG=master
RUN git clone --depth 1 --branch ${TTAG} https://github.com/ceccocats/tkDNN.git \
    && mkdir tkdnn_build

WORKDIR /usr/local/src/tkdnn_build
RUN cmake \
    -G Ninja \
    -D CMAKE_INSTALL_PREFIX=/usr/local/tkdnn \
    /usr/local/src/tkdnn
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc)

# FINAL IMAGE
FROM ${BASE_IMAGE}

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    libyaml-cpp0.5v5 python3-numpy \
    #
    # Clean up
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# install opencv
COPY --from=builder /usr/local/src/opencv_build/OpenCV-*-aarch64.sh /tmp/
RUN /tmp/OpenCV-*-aarch64.sh --skip-license --prefix=/usr/local \
    && rm /tmp/OpenCV-*-aarch64.sh

# install tkdnn
# COPY --from=builder /usr/local/tkdnn /usr/local/tkdnn
# RUN echo "/usr/local/tkdnn/lib" > /etc/ld.so.conf.d/tkdnn.conf \
#     && ldconfig
# ENV PATH=$PATH:/usr/local/tkdnn/bin
COPY --from=builder /usr/local/tkdnn/bin /usr/local/bin
COPY --from=builder /usr/local/tkdnn/lib /usr/local/lib
iripatx commented 3 years ago

Thank you very much for sharing! :)

masip85 commented 2 years ago

How is this container going to compile if it doesn't have cudnn? What am I not seeing here?

I'm aware that this image contains many tools that are not needed. I picked it to do some quick tests,though it would be better to take the L4T base image and extend it.

I do use l4t-base , dockerfile looks like this


ARG BUILD_IMAGE=nvcr.io/nvidia/l4t-base:r32.4.4
ARG BASE_IMAGE=${BUILD_IMAGE}

FROM ${BUILD_IMAGE} as builder

RUN apt-get update \
    && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    build-essential cmake git ninja-build \
    libgtk-3-dev python3-dev python3-numpy \
    ca-certificates file \
    libeigen3-dev libyaml-cpp-dev libssl-dev \
    #
    # Clean up
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# CMAKE
WORKDIR /usr/local/src
ARG CTAG=v3.18.4
RUN git clone --depth 1 --branch ${CTAG} https://github.com/Kitware/CMake.git \
    && mkdir cmake_build

WORKDIR /usr/local/src/cmake_build
RUN cmake \
    -G Ninja \
    /usr/local/src/CMake
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc)

# OPENCV
# https://docs.opencv.org/master/d2/de6/tutorial_py_setup_in_ubuntu.html
WORKDIR /usr/local/src
ARG CVTAG=4.5.0
RUN git clone --depth 1 --branch ${CVTAG} https://github.com/opencv/opencv.git \
    && git clone --depth 1 --branch ${CVTAG} https://github.com/opencv/opencv_contrib.git \
    && mkdir opencv_build

WORKDIR /usr/local/src/opencv_build
RUN cmake \
    -G Ninja \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN='5.3 7.2' \
    -D CUDA_FAST_MATH=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D OPENCV_EXTRA_MODULES_PATH=/usr/local/src/opencv_contrib/modules \
    /usr/local/src/opencv
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc) \
    && ninja package -j$(nproc)

# TKDNN
WORKDIR /usr/local/src
ARG TTAG=master
RUN git clone --depth 1 --branch ${TTAG} https://github.com/ceccocats/tkDNN.git \
    && mkdir tkdnn_build

WORKDIR /usr/local/src/tkdnn_build
RUN cmake \
    -G Ninja \
    -D CMAKE_INSTALL_PREFIX=/usr/local/tkdnn \
    /usr/local/src/tkdnn
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc)

# FINAL IMAGE
FROM ${BASE_IMAGE}

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    libyaml-cpp0.5v5 python3-numpy \
    #
    # Clean up
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# install opencv
COPY --from=builder /usr/local/src/opencv_build/OpenCV-*-aarch64.sh /tmp/
RUN /tmp/OpenCV-*-aarch64.sh --skip-license --prefix=/usr/local \
    && rm /tmp/OpenCV-*-aarch64.sh

# install tkdnn
# COPY --from=builder /usr/local/tkdnn /usr/local/tkdnn
# RUN echo "/usr/local/tkdnn/lib" > /etc/ld.so.conf.d/tkdnn.conf \
#     && ldconfig
# ENV PATH=$PATH:/usr/local/tkdnn/bin
COPY --from=builder /usr/local/tkdnn/bin /usr/local/bin
COPY --from=builder /usr/local/tkdnn/lib /usr/local/lib
hlacikd commented 2 years ago

How is this container going to compile if it doesn't have cudnn? What am I not seeing here?

I'm aware that this image contains many tools that are not needed. I picked it to do some quick tests,though it would be better to take the L4T base image and extend it.

I do use l4t-base , dockerfile looks like this


ARG BUILD_IMAGE=nvcr.io/nvidia/l4t-base:r32.4.4
ARG BASE_IMAGE=${BUILD_IMAGE}

FROM ${BUILD_IMAGE} as builder

RUN apt-get update \
    && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    build-essential cmake git ninja-build \
    libgtk-3-dev python3-dev python3-numpy \
    ca-certificates file \
    libeigen3-dev libyaml-cpp-dev libssl-dev \
    #
    # Clean up
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# CMAKE
WORKDIR /usr/local/src
ARG CTAG=v3.18.4
RUN git clone --depth 1 --branch ${CTAG} https://github.com/Kitware/CMake.git \
    && mkdir cmake_build

WORKDIR /usr/local/src/cmake_build
RUN cmake \
    -G Ninja \
    /usr/local/src/CMake
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc)

# OPENCV
# https://docs.opencv.org/master/d2/de6/tutorial_py_setup_in_ubuntu.html
WORKDIR /usr/local/src
ARG CVTAG=4.5.0
RUN git clone --depth 1 --branch ${CVTAG} https://github.com/opencv/opencv.git \
    && git clone --depth 1 --branch ${CVTAG} https://github.com/opencv/opencv_contrib.git \
    && mkdir opencv_build

WORKDIR /usr/local/src/opencv_build
RUN cmake \
    -G Ninja \
    -D WITH_CUDA=ON \
    -D CUDA_ARCH_BIN='5.3 7.2' \
    -D CUDA_FAST_MATH=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D OPENCV_EXTRA_MODULES_PATH=/usr/local/src/opencv_contrib/modules \
    /usr/local/src/opencv
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc) \
    && ninja package -j$(nproc)

# TKDNN
WORKDIR /usr/local/src
ARG TTAG=master
RUN git clone --depth 1 --branch ${TTAG} https://github.com/ceccocats/tkDNN.git \
    && mkdir tkdnn_build

WORKDIR /usr/local/src/tkdnn_build
RUN cmake \
    -G Ninja \
    -D CMAKE_INSTALL_PREFIX=/usr/local/tkdnn \
    /usr/local/src/tkdnn
RUN ninja -j$(nproc) \
    && ninja install -j$(nproc)

# FINAL IMAGE
FROM ${BASE_IMAGE}

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
    && apt-get -y install --no-install-recommends \
    libyaml-cpp0.5v5 python3-numpy \
    #
    # Clean up
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# install opencv
COPY --from=builder /usr/local/src/opencv_build/OpenCV-*-aarch64.sh /tmp/
RUN /tmp/OpenCV-*-aarch64.sh --skip-license --prefix=/usr/local \
    && rm /tmp/OpenCV-*-aarch64.sh

# install tkdnn
# COPY --from=builder /usr/local/tkdnn /usr/local/tkdnn
# RUN echo "/usr/local/tkdnn/lib" > /etc/ld.so.conf.d/tkdnn.conf \
#     && ldconfig
# ENV PATH=$PATH:/usr/local/tkdnn/bin
COPY --from=builder /usr/local/tkdnn/bin /usr/local/bin
COPY --from=builder /usr/local/tkdnn/lib /usr/local/lib

Well l4t containers are ment to be used on their jetson products , their os contains modified nvidia-container-runtime which mounts Cuda and cudnn libraries from os into container during run . As much as I hate it , their reason is to make container images smaller

masip85 commented 2 years ago

But as far as I know, that container only mounts cuda, not cuda cudnn or tensor rt. In fact, I've tested it,and it doesn't detect cudnn when compiling openCV. Maybe I am doing something wrong, like the build command. Does it need to be specified that is nvidia runtime like the run command?

hlacikd commented 2 years ago

But as far as I know, that container only mounts cuda, not cuda cudnn or tensor rt. In fact, I've tested it,and it doesn't detect cudnn when compiling openCV. Maybe I am doing something wrong, like the build command. Does it need to be specified that is nvidia runtime like the run command?

then let me extend your knowledge ;)

jetpack has following packages :

#       - nvidia-container-csv-cuda
#       - nvidia-container-csv-cudnn
#       - nvidia-container-csv-tensorrt

they depend on nvidia-cuda , nvidia-cudnn8, and nvidia-tensorrt

when installed, following files are deployed :

lzzii@jtsna-2109beta1:~$ ls /etc/nvidia-container-runtime/host-files-for-container.d/
cuda.csv  cudnn.csv  l4t.csv  tensorrt.csv  visionworks.csv

which instructs nvidia-container-runtime to mount cuda/cuddn/tensorflow libraries inside l4t-base image.

Your welcome ;)

masip85 commented 2 years ago

Thank you very much. Ok, now I start to see how this is going. My jp 4.4 installed all as expected

libnvidia-container-tools/stable,now 0.9.0~beta.1 arm64 [instalado]
libnvidia-container0/stable,now 0.9.0~beta.1 arm64 [instalado]
nvidia-container/stable 4.4.1-b50 arm64
nvidia-container-csv-cuda/stable 10.2.89-1 arm64 [actualizable desde: 10.2.89-1]
nvidia-container-csv-cudnn/stable,now 8.0.0.180-1+cuda10.2 arm64 [instalado]
nvidia-container-csv-tensorrt/stable,now 7.1.3.0-1+cuda10.2 arm64 [instalado]
nvidia-container-csv-visionworks/stable,now 1.6.0.501 arm64 [instalado]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [instalado]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [instalado]

and:

ls /etc/nvidia-container-runtime/host-files-for-container.d/
cuda.csv  cudnn.csv  l4t.csv  tensorrt.csv  visionworks.csv

but , your dockerfile works copying-pasting it inside a container run with --runtime nvidia , but it doesn't if I use it a docker build. So I guess I am not pointing to the nvidia runtime when building. Am I right?

ckurtz22 commented 2 years ago

Thanks for the dockerfile script! I am still having an issue where it crashes at runtime when calling initing a Yolo3Detector instance. The specific place it crashes is https://github.com/ceccocats/tkDNN/blob/master/src/NetworkRT.cpp#L37 when calling builderRT->platformHasFastFp16() Any tips for fixing this?

Edit: did more debugging, when I run the tkDNN tests it errors with the message CUDNN failure: CUDNN_STATUS_NOT_INITIALIZED