EGL errors in docker - Githubissues

srama2512 commented 4 years ago

I followed the instructions to build the local docker file.

docker build . --file Pointnav_DDPPO_baseline.Dockerfile -t pointnav_submission_debug

It built successfully, but local testing via ./test_locally_pointnav_rgbd.sh resulted in the following error:

Neither `ifconfig` (`ifconfig -a`) nor `ip` (`ip address show`) commands are available, listing network interfaces is likely to fail
2020-05-05 07:03:09,730 Overwriting CNN input size of depth: (256, 256)
2020-05-05 07:03:09,731 Overwriting CNN input size of rgb: (256, 256)
2020-05-05 07:03:12,762 Model checkpoint wasn't loaded, evaluating a random model.
2020-05-05 07:03:12,777 Initializing dataset PointNav-v1
2020-05-05 07:03:12,779 initializing sim Sim-v0
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0505 07:03:12.791268    16 WindowlessContext.cpp:114] Check failed: eglDevId < numDevices [EGL] Could not find an EGL device for CUDA device 0
*** Check failure stack trace: ***
submission.sh: line 3:    16 Aborted                 (core dumped) python agent.py --evaluation $AGENT_EVALUATION_TYPE $@

I created an interactive session inside the docker via:

docker run -v /tmp/habitat-challenge-data:/habitat-challenge-data --runtime=nvidia -it pointnav_submission_debug /bin/bash

nvidia-smi worked:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00   Driver Version: 418.116.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GP100        Off  | 00000000:81:00.0 Off |                    0 |
| 26%   37C    P0    31W / 235W |      0MiB / 16278MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro GP100        Off  | 00000000:82:00.0 Off |                    0 |
| 26%   37C    P0    29W / 235W |      0MiB / 16278MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Running a simple pytorch code on the GPU also worked:

>>> python -c "import torch, torch.nn as nn; device = torch.device('cuda:0'); model = nn.Linear(4, 2); model.to(device);  x = torch.randn(1, 4).to(device); y = model(x); print(y)"

tensor([[0.0405, 0.1198]], device='cuda:0', grad_fn=<AddmmBackward>)

erikwijmans commented 4 years ago

If you system has a non-standard EGL install, i.e. if you need to do something like export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nvidia-opengl:${LD_LIBRARY_PATH}, you will likely need to mount /usr/lib/x86_64-linux-gnu/nvidia-opengl (add -v /usr/lib/x86_64-linux-gnu/nvidia-opengl) and set LD_LIBRARY_PATH in the docker container also.

srama2512 commented 4 years ago

Thanks! It works now.

rpartsey commented 4 years ago

If you system has a non-standard EGL install, i.e. if you need to do something like export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/nvidia-opengl:${LD_LIBRARY_PATH}, you will likely need to mount /usr/lib/x86_64-linux-gnu/nvidia-opengl (add -v /usr/lib/x86_64-linux-gnu/nvidia-opengl) and set LD_LIBRARY_PATH in the docker container also.

Hi, @erikwijmans I faced the same issue as @srama2512 described above

But the path that you suggested to mount -v /usr/lib/x86_64-linux-gnu/nvidia-opengl is missing on my machine. (It looks like the nvidia-opengl is not installed on the machine)

Could you, please, explain to me what this library is used for and how can I install it.

I'm trying to google nvidia-opengl but unable to find

UPD: I tried the suggestions listed here, but nothing worked on my machine. Also, I created a new GPU instance on the cloud and following this comment navigated to https://hub.docker.com/r/nvidia/cudagl and run all installation commands listed in Dockerfiles in section CUDA 10.1 update 2 + OpenGL (glvnd 1.2) (10.1/base/Dockerfile) + (glvnd/devel/Dockerfile) but still get the error described above.

Would be very grateful if somebody could help me to resolve this issue or provide the list of instructions you run to set up the machine.

vincent341 commented 3 years ago

Hi @rpartsey , I met the same problem as you did. May I know if you have solved it? Thanks very much.

rpartsey commented 3 years ago

Hi @rpartsey , I met the same problem as you did. May I know if you have solved it? Thanks very much.

Hi @vincent341 Yes, I faced the same problem. The root cause was incomplete CUDA installation.

Some packages require CUDA development tools (that if I'm not mistaken should be properly installed either on your computer(host) or inside the docker container).

But base nvidia docker images doesn't include them. See Overview of Images section https://hub.docker.com/r/nvidia/cuda/.

Inspired by this Stack Overflow response, I used devel docker image as an example and added the following RUN command that solved the issue for me

FROM fairembodied/habitat-challenge:testing_2020_habitat_base_docker

ARG TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX 7.5+PTX"

RUN apt-get update && apt-get install -y --no-install-recommends \
    cuda-nvml-dev-$CUDA_PKG_VERSION \
    cuda-command-line-tools-$CUDA_PKG_VERSION \
    cuda-nvprof-$CUDA_PKG_VERSION \
    cuda-npp-dev-$CUDA_PKG_VERSION \
    cuda-libraries-dev-$CUDA_PKG_VERSION \
    cuda-minimal-build-$CUDA_PKG_VERSION \
    libcublas-dev=10.2.1.243-1 \
    libnccl-dev=$NCCL_VERSION-1+cuda10.1 \
    && apt-mark hold libnccl-dev \
    && rm -rf /var/lib/apt/lists/*

# ...

vincent341 commented 3 years ago

Hi @rpartsey ,

Thanks very much for your instructions. Let me try. I'm still struggling with it now.

facebookresearch / habitat-challenge

EGL errors in docker #40