PUTvision / LunarSim

16 stars 4 forks source link

Simulation failed to run within container #4

Closed TL-4319 closed 2 months ago

TL-4319 commented 2 months ago

Hi,

Thank you for the hard work.

I recently try to give this project a try and been running into a problem.

This is my setup:

I have setup docker and been successfully run the ros2 container following this tutorial .

I have also installed nvidia-docker2 and nvidia-container-toolkit using

sudo apt-get install -y nvidia-container-toolkit
sudo apt install -y nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker

To double check the nvidia-driver is working, I ran nvidia-smi

(base) lagerprocessor@lagerprocessor:~/Projects/LunarSim$ nvidia-smi
Wed Jun 19 17:44:55 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:B3:00.0  On |                  N/A |
|  0%   49C    P8    47W / 320W |   1072MiB / 10008MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1946      G   /usr/bin/gnome-shell              127MiB |
|    0   N/A  N/A      2947      G   /usr/lib/xorg/Xorg                 28MiB |
|    0   N/A  N/A      3154      G   /usr/bin/gnome-shell              204MiB |
|    0   N/A  N/A      3547      G   /usr/lib/firefox/firefox          271MiB |
|    0   N/A  N/A      7416      G   /usr/lib/xorg/Xorg                218MiB |
|    0   N/A  N/A      7623      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      8116      G   /usr/lib/firefox/firefox          149MiB |
+-----------------------------------------------------------------------------+

The lunarsim:latest docker is also built successfully with the following return

[+] Building 0.5s (17/17) FINISHED                                                                                                                                                                                 
 => [internal] load build definition from Dockerfile                                                                                                                                                          0.0s
 => => transferring dockerfile: 2.09kB                                                                                                                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                                                             0.0s
 => => transferring context: 2B                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/ubuntu:22.04                                                                                                                                               0.4s
 => [ 1/13] FROM docker.io/library/ubuntu:22.04@sha256:19478ce7fc2ffbce89df29fea5725a8d12e57de52eb9ea570890dc5852aac1ac                                                                                       0.0s
 => CACHED [ 2/13] RUN ln -snf /usr/share/zoneinfo/Europe/Warsaw /etc/localtime && echo Europe/Warsaw > /etc/timezone                                                                                         0.0s
 => CACHED [ 3/13] RUN apt clean                                                                                                                                                                              0.0s
 => CACHED [ 4/13] RUN apt update && apt install -y --no-install-recommends         ca-certificates         devilspie         gnupg2         mesa-utils         sudo         unzip         wget         xfce  0.0s
 => CACHED [ 5/13] RUN apt install nvidia-driver-525 -y                                                                                                                                                       0.0s
 => CACHED [ 6/13] RUN apt-get install -y libglu1 xvfb libxcursor1 libvulkan1 mesa-vulkan-drivers vulkan-tools -y                                                                                             0.0s
 => CACHED [ 7/13] RUN curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key -o /usr/share/keyrings/ros-archive-keyring.gpg                                                                0.0s
 => CACHED [ 8/13] RUN echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/ros-archive-keyring.gpg] http://packages.ros.org/ros2/ubuntu $(. /etc/os-release && echo $UBUNTU_CODENAME)  0.0s
 => CACHED [ 9/13] RUN apt update -y && apt install -y ros-humble-desktop                                                                                                                                     0.0s
 => CACHED [10/13] RUN touch /root/.bashrc  && echo "echo '|======================================================|'" >> /root/.bashrc  && echo "echo '| DO NOT SOURCE ROS IN TERMINAL WHERE YOU RUN LUNARSI  0.0s
 => CACHED [11/13] RUN mkdir -p /root/lunarsim                                                                                                                                                                0.0s
 => CACHED [12/13] RUN wget "https://github.com/PUTvision/LunarSim/releases/latest/download/LunarSim.tar.xz"  && cd /root/lunarsim  && tar xf /LunarSim.tar.xz  && rm -f /LunarSim.tar.xz  && chmod +x /root  0.0s
 => CACHED [13/13] WORKDIR /root/lunarsim                                                                                                                                                                     0.0s
 => exporting to image                                                                                                                                                                                        0.0s
 => => exporting layers                                                                                                                                                                                       0.0s
 => => writing image sha256:fe5236bd0327f4c94120aad4abca99bae9592c134d09806371ef8354852faa48                                                                                                                  0.0s
 => => naming to docker.io/library/lunarsim:latest  

Now to run the simulation, I run

bash ./run_once.sh

Then within the container, I start the sim and receive the following text

root@lagerprocessor:~/lunarsim# ./LunarSim.x86_64 
[UnityMemory] Configuration Parameters - Can be set up in boot.config
    "memorysetup-bucket-allocator-granularity=16"
    "memorysetup-bucket-allocator-bucket-count=8"
    "memorysetup-bucket-allocator-block-size=4194304"
    "memorysetup-bucket-allocator-block-count=1"
    "memorysetup-main-allocator-block-size=16777216"
    "memorysetup-thread-allocator-block-size=16777216"
    "memorysetup-gfx-main-allocator-block-size=16777216"
    "memorysetup-gfx-thread-allocator-block-size=16777216"
    "memorysetup-cache-allocator-block-size=4194304"
    "memorysetup-typetree-allocator-block-size=2097152"
    "memorysetup-profiler-bucket-allocator-granularity=16"
    "memorysetup-profiler-bucket-allocator-bucket-count=8"
    "memorysetup-profiler-bucket-allocator-block-size=4194304"
    "memorysetup-profiler-bucket-allocator-block-count=1"
    "memorysetup-profiler-allocator-block-size=16777216"
    "memorysetup-profiler-editor-allocator-block-size=1048576"
    "memorysetup-temp-allocator-size-main=4194304"
    "memorysetup-job-temp-allocator-block-size=2097152"
    "memorysetup-job-temp-allocator-block-size-background=1048576"
    "memorysetup-job-temp-allocator-reduction-small-platforms=262144"
    "memorysetup-temp-allocator-size-background-worker=32768"
    "memorysetup-temp-allocator-size-job-worker=262144"
    "memorysetup-temp-allocator-size-preload-manager=262144"
    "memorysetup-temp-allocator-size-nav-mesh-worker=65536"
    "memorysetup-temp-allocator-size-audio-worker=65536"
    "memorysetup-temp-allocator-size-cloud-worker=32768"
    "memorysetup-temp-allocator-size-gfx=262144"
root@lagerprocessor:~/lunarsim#

It seems like the GUI failed to open? I also receive an error code for "Segmentation Fault: Core Dumped" but I have yet to replicate that.

I would appreciate any pointers on how to resolve this or how to start logging more thing to hopefully help with debugging.

Thanks

TL-4319 commented 2 months ago

Another thing I notice now while experimenting is that within the lunarsim container, after I have done

source /opt/ros/humble/setup.bash

rviz2 

RVIZ2 failed to open and show the following error.

root@lagerprocessor:~/lunarsim# rviz2
QStandardPaths: wrong permissions on runtime directory /tmp, 0777 instead of 0700
Unable to create glx context

There are a lot of error for "Failed to create an OpenGL context" then

terminate called after throwing an instance of 'std::runtime_error'
  what():  Unable to create the rendering window after 100 tries
Aborted (core dumped)

I have modified the run_once.sh file as follow

#!/bin/bash
xhost +local:root
XSOCK=/tmp/.X11-unix

docker run -it --rm \
 --gpus all \
 -e DISPLAY=$DISPLAY \
 -v $XSOCK:$XSOCK \
 -v $HOME/.Xauthority:/root/.Xauthority \
 --privileged \
 --net=host \
 osrf/ros:humble-desktop bash

Running the above bash script and source ros2 in the container and I was able to run RVIZ2 as usual.

Hopefully this little experiment can provide some insight on what can be the cause

TL-4319 commented 2 months ago

Some more experimentation. I update the host PC NVIDIA driver to version 535.146.02. I changed the docker file in the repo in line 38 from "RUN apt install nvidia-driver-525 -y" to "RUN apt install nvidia-driver-535 -y" which also build successfully.

When I run the built image and within the container, I ran "nvidia-smi" and see the following

root@lagerprocessor:~/lunarsim# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183

I have also tried it with the original 525 driver but still get the same error code above

TL-4319 commented 2 months ago

I have finally got the sim running. I need to install the matching NVML that the container is expecting 535.183 in this case. I did so following the instruction from here to unload the nvidia-drm. This is required to run the NVIDIA installer on my particular system. I also reverted the nvidia driver installed by DockerFile to 525 since it does not seems to matter.

This is still weird that the ros2 image from OSRF does not have any issue with the NVIDIA driver while this image has issue. Regardless, this might just be an issue with NVIDIA drivers so I'll close the issue. Thank you for the great work.