Running container using nvidia-docker2

kekeblom commented 4 years ago

I seem to have the same issue as here https://github.com/RobotLocomotion/LabelFusion/issues/74

I.e. when I run run_alignment_tool in the mounted data directory (using the provided sample data), I get: libGL error: No matching fbConfigs or visuals found. The director GUI pops up, but it's unable to open an OpenGL context.

I get the exact same error message when I run the glxgears test program from the mesa-utils package.

My understanding of the issue is that the OpenGL libraries that are inside the container do not match those which are running on the host computer or are unable to load.

I'm running nvidia-docker2 and my Docker version is 19.03.8, build afacb8b7f0. I'm running Nvidia driver version 440.64.00 on the host machine.

It seems Nvidia does not officially support glx on Nvidia docker. However, they do have cudagl images available here https://hub.docker.com/r/nvidia/cudagl. I'm not exactly sure which part of that image's docker file is key, but on that image, I am able to run glxgears i.e. OpenGL runs fine.

I could rebuild the container using that image. I tried that, but the container no longer builds. There is an error message related to vtk not being the right version. I can get around this by changing/adding -DUSE_SYSTEM_VTK:BOOL=OFF and -DUSE_PRECOMPILED_VTK=ON in the compile_all.sh script. However, the install then fails for some other reason which I didn't fully investigate. Other issues might come up though as the cuda version would get bumped up to 9 and some system packages might get updated.

Probably there is just some minor glitch on my system, which is why I'm opening this issue. A comment on the original issue I referenced, says "you should pull the nvidia-docker2 image, not the nvidia-docker1." and this seems to have resolved that persons issue. However, I don't exactly know what that means. I'm running nvidia-docker2 and I'm pulling the latest image from docker hub using nvidia-docker pull robotlocomotion/labelfusion.

Does anyone know what might be wrong here?

kekeblom commented 4 years ago

You'll all be happy to hear that I was able to solve this issue. What seems to be happening is that the opengl libraries inside the container were not compatible with what is running on my system. I tried both the Nvidia drivers 390 and 440 but no luck. I'm not sure what the actual issue is, maybe it also has something to do with how the X server is configured on the host.

What worked in resolving this issue is installing libglvnd which is designed as a compatibility layer between the graphics libraries. It supports GLX which is used by LabelFusion. I derived a new image which is based on the labelfusion image and installed those libraries as they are installed in the official Nvidia cudagl images. All credit to them. See the end of the message for the exact Dockerfile I used.

It seems the current setup is quite reliant on how the host machine is set up. I would create a formal pull request updating the image, but unfortunately, I was unable to build the original image. Quite a few libraries seem to have updated and some of the dependencies no longer build.

It could be worth looking into updating e.g. Director to use a newer official version to reduce the risk of this software being left behind permanently. Would anyone more familiar with these projects be able to estimate how big an undertaking that would be? What would be the main issues?

Here is the Dockerfile I used to build my image.

FROM robotlocomotion/labelfusion:latest

RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates apt-transport-https gnupg-curl && \
    NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \
    NVIDIA_GPGKEY_FPR=ae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80 && \
    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub && \
    apt-key adv --export --no-emit-version -a $NVIDIA_GPGKEY_FPR | tail -n +5 > cudasign.pub && \
    echo "$NVIDIA_GPGKEY_SUM  cudasign.pub" | sha256sum -c --strict - && rm cudasign.pub && \
    echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
    echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
    apt-get purge --auto-remove -y gnupg-curl && \
rm -rf /var/lib/apt/lists/*

### OpenGL

RUN apt-get update && apt-get install -y --no-install-recommends \
        git \
        ca-certificates \
        make \
        automake \
        autoconf \
        libtool \
        pkg-config \
        python \
        libxext-dev \
        libx11-dev \
        x11proto-gl-dev && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /opt/libglvnd
RUN git clone --branch="v1.1.0" https://github.com/NVIDIA/libglvnd.git . && \
    ./autogen.sh && \
    ./configure --prefix=/usr/local --libdir=/usr/local/lib/x86_64-linux-gnu && \
    make -j"$(nproc)" install-strip && \
    find /usr/local/lib/x86_64-linux-gnu -type f -name 'lib*.la' -delete

RUN dpkg --add-architecture i386 && \
    apt-get update && apt-get install -y --no-install-recommends \
        gcc-multilib \
        libxext-dev:i386 \
        libx11-dev:i386 && \
    rm -rf /var/lib/apt/lists/*

# 32-bit libraries
RUN make distclean && \
    ./autogen.sh && \
    ./configure --prefix=/usr/local --libdir=/usr/local/lib/i386-linux-gnu --host=i386-linux-gnu "CFLAGS=-m32" "CXXFLAGS=-m32" "LDFLAGS=-m32" && \
    make -j"$(nproc)" install-strip && \
    find /usr/local/lib/i386-linux-gnu -type f -name 'lib*.la' -delete

COPY 10_nvidia.json /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json

RUN echo '/usr/local/lib/x86_64-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && \
    echo '/usr/local/lib/i386-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && \
    ldconfig

ENV LD_LIBRARY_PATH /usr/local/lib/x86_64-linux-gnu:/usr/local/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

WORKDIR /root

ENTRYPOINT bash -c "source /root/labelfusion/docker/docker_startup.sh && /bin/bash"

iamlucaswolf commented 3 years ago

Thanks, @kekeblom , this resolved the issue for me! 👍🏻

Note that this expects 10_nvidia.json to be present in the docker build context. To this end, I replaced the last copy instruction with the one below, which should be a little more robust:

COPY --from=nvidia/opengl:1.0-glvnd-runtime-ubuntu16.04 \
  /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json \
  /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json

RobotLocomotion / LabelFusion

Running container using nvidia-docker2 #84