Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
8.3k stars 296 forks source link

Wrap NVTOP in docker (Impossible to initialize nvidia nvml) #42

Open chichivica opened 5 years ago

chichivica commented 5 years ago

Hi guys, thanks for awesome tool. Could you give an example how to wrap nvtop in docker?

Unfortunately this one:

FROM nvidia/cuda

RUN apt-get update && \
    apt-get install -y cmake libncurses5-dev libncursesw5-dev git && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    rm -rf /work/*

RUN ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1

RUN cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop

CMD ["/usr/local/bin/nvtop"]

Results in:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Impossible to initialize nvidia nvml : 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

When I try to run with:

docker run --runtime=nvidia nvtop

Any ideas?

cafeal commented 5 years ago

@chichivica I did it in my repository and uploaded the image to my dockerhub you can use it by following command:

docker run --runtime nvidia --rm -ti 69guitar1015/nvtop
RuRo commented 5 years ago

@chichivica, you forgot to remove the stub .so symlinks after building in the Dockerfile. I was able to build the current nvtop version with the following Dockerfile:

FROM nvidia/cuda

RUN apt-get update && \
    apt-get install -y cmake libncurses5-dev libncursesw5-dev git && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

ENTRYPOINT ["/usr/local/bin/nvtop"]
lamhoangtung commented 4 years ago

Thanks @RuRo. It's worked

lminer commented 4 years ago

I'm trying to do this in conjunction with the tensorflow dockerfile and it isn't working.

The problem seems to be that libnvidia-ml is in a different location: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.430.50

I tried modifying the dockerfile as follows, but no luck.

FROM tensorflow/tensorflow:2.2.0rc3-gpu

RUN apt-get update && apt-get install -y --no-install-recommends \
    bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \
    libsox-fmt-all sox libsox-dev \
    tmux zsh vim wget git \
    nano google-perftools \
    cmake libncurses5-dev libncursesw5-dev

RUN ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. -DNVML_RETRIEVE_HEADER_ONLINE=True && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

I get:

CMake Error at /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find NVML (missing: NVML_INCLUDE_DIRS)
Call Stack (most recent call first):
  /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  cmake/modules/FindNVML.cmake:52 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:31 (find_package)

If I add the option -DNVML_RETRIEVE_HEADER_ONLINE=True, I get:

make[2]: *** No rule to make target '/usr/local/lib/libnvidia-ml.so', needed by 'src/nvtop'.  Stop.
make[1]: *** [src/CMakeFiles/nvtop.dir/all] Error 2

Any ideas?

RuRo commented 4 years ago

@lminer you don't need the real libnvidia-ml.so file, you need the stubs. AFAIK, attempting to use the actual Nvidia shared objects during docker build will always fail, because the shared objects shouldn't actually be inside the container. Instead, they are mounted by the Nvidia Runtime from the host (you can tell by the driver version 430.50 in the so filename). docker build doesn't use the Nvidia Runtime by default, so the actual so files won't be there during the build.

It seems, that the tensorflow folks decided that they will use the nvidia/cuda:*-base-* images, which only have the bare minimum required to use GPUs and that they will provide every build dependency themselves. The base and runtime images don't have any stubs, so you are out of luck.

You'll either have to build tensorflow on your own with nvidia/cuda:*-devel-* as a base image or to provide your own stub so files. Well, maybe I am missing some third option, but eh.

lminer commented 4 years ago

@RuRo Thanks for such a comprehensive explainer. I'll give that a shot!

VictorAtPL commented 4 years ago

@RuRo

I was able to build the current nvtop version with the following Dockerfile:

FROM nvidia/cuda

RUN apt-get update && \
    apt-get install -y cmake libncurses5-dev libncursesw5-dev git && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1

ENTRYPOINT ["/usr/local/bin/nvtop"]

Thank you for providing your Dockerfile. I changed base image from nvidia/cuda to nvidia/cuda:10.1-devel-ubuntu16.04, successfully built image and when I run it, I get following error:

/usr/local/bin/nvtop: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

Edit: Oops. Forgot about --runtime nvidia.

lminer commented 4 years ago

Trying this with cuda 11.0 and am running into issues again. Now the stub files aren't present. Is there something that I should be installing that I haven't installed?

Basically /usr/local/cuda-11.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so doesn't exist and I get

CMake Error at /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find NVML (missing: NVML_INCLUDE_DIRS)
Call Stack (most recent call first):
  /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  cmake/modules/FindNVML.cmake:52 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:31 (find_package)

Here's the dockerfile

ARG UBUNTU_VERSION=18.04

ARG ARCH=
ARG CUDA=11.0
FROM nvidia/cuda${ARCH:+-$ARCH}:${CUDA}-base-ubuntu${UBUNTU_VERSION} as base
# ARCH and CUDA are specified again because the FROM directive resets ARGs
# (but their default value is retained if set previously)

ARG ARCH
ARG CUDA
ARG CUDNN=8.0.4.30-1
ARG CUDNN_MAJOR_VERSION=8
ARG LIB_DIR_PREFIX=x86_64
ARG LIBNVINFER=7.1.3-1
ARG LIBNVINFER_MAJOR_VERSION=7

# Needed for string substitution
SHELL ["/bin/bash", "-c"]

RUN apt-get update && apt-get install -y --no-install-recommends \
    apt-utils \
    build-essential \
    cuda-command-line-tools-${CUDA/./-} \
    libcublas-${CUDA/./-} \
    cuda-nvrtc-${CUDA/./-} \
    libcufft-${CUDA/./-} \
    libcurand-${CUDA/./-} \
    libcusolver-${CUDA/./-} \
    libcusparse-${CUDA/./-} \
    curl \
    libcudnn8=${CUDNN}+cuda${CUDA} \
    libfreetype6-dev \
    libhdf5-serial-dev \
    libzmq3-dev \
    pkg-config \
    software-properties-common \
    unzip

# Install TensorRT if not building for PowerPC
RUN [[ "${ARCH}" = "ppc64le" ]] || { apt-get update && \
    apt-get install -y --no-install-recommends libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
    libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*; }

# For CUDA profiling, TensorFlow requires CUPTI.
ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Link the libcuda stub to the location where tensorflow is searching for it and reconfigure
# dynamic linker run-time bindings
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/z-cuda-stubs.conf \
    && ldconfig

RUN apt-get update && apt-get install -y --no-install-recommends \
    bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \
    libsox-fmt-all sox libsox-dev htop python3 \
    tmux zsh vim wget git git-lfs \
    nano google-perftools unzip \
    cmake libncurses5-dev libncursesw5-dev python3-dev

# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8

SHELL ["/usr/bin/zsh", "-c"]

# install nvtop
RUN ln -s /usr/local/cuda-11.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so && \
    ln -s /usr/local/cuda-11.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/local/lib/libnvidia-ml.so.1 && \
    cd /tmp && \
    git clone https://github.com/Syllo/nvtop.git && \
    mkdir -p nvtop/build && cd nvtop/build && \
    cmake .. && \
    make && \
    make install && \
    cd / && \
    rm -r /tmp/nvtop && \
    rm /usr/local/lib/libnvidia-ml.so && \
    rm /usr/local/lib/libnvidia-ml.so.1
RuRo commented 4 years ago

@lminer As I already mentioned, nvidia/cuda:*-base-* images don't have stubs. You'll have to build with nvidia/cuda:*-devel-* or manually add stubs to the base image.

lminer commented 4 years ago

Wow you're right. Sorry about that. Thanks for being so patient with me.

qwertychouskie commented 11 months ago

Now that this repository contains a pre-made dockerfile, this should probably be closed.