aws-deepracer-community / deepracer-for-cloud

Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.
MIT No Attribution
325 stars 176 forks source link

Improved Installation / Prepare Script #160

Closed larsll closed 3 months ago

larsll commented 3 months ago

Significantly simplified prepare script with the following improvements:

MarkRoss-Eviden commented 3 months ago

Initial testing not successful so please don't merge @larsll : -

image

I need to do some investigation as to what's causing it. Ideas?

larsll commented 3 months ago

What are you trying to do? (Platform, action etc.)

MarkRoss-Eviden commented 3 months ago

Need to test several cases: - 1 - GPU sagemaker / CPU robomaker 2 - GPU sagemaker / GPU robomaker 3 - GPU sagemaker/ CPU robomaker with OpenGL 4 - GPU sagemaker/ GPU robomaker with OpenGL

MarkRoss-Eviden commented 3 months ago

Test case 1 works with following settings (defaults in the DOTS repo): -

image

image

MarkRoss-Eviden commented 3 months ago

Test case 2 works: -

image

image

MarkRoss-Eviden commented 3 months ago

Test case 3 did not work: - image

Trying to manually run from the created instance: - image

Assume some of the config that currently works no longer works with your slimmed down changes @larsll. Relevant bits of the code that works with current main branch: -

Creation of AMI occurs here - https://github.com/aws-deepracer-community/deepracer-on-the-spot/blob/main/scripts/image-builder.yaml which mainly deals with prepare and install. Perhaps I now need to add in some pre-reqs you've removed?

Then when the instance runs this is the bit of the code that runs when you're trying to use OpenGL (https://github.com/aws-deepracer-community/deepracer-on-the-spot/blob/main/spot-instance.yaml): -

# Setup required config if using OpenGL training
if [[ $DR_HOST_X == True ]];then
  distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed 's/\.//')
  sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
  sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64/7fa2af80.pub
  echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
  echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda_learn.list
  sudo apt update && sudo apt install -y nvidia-driver-470-server cuda-minimal-build-11-4 --no-install-recommends -o Dpkg::Options::="--force-overwrite"
  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
  sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime xserver-xorg-dev xutils-dev
  ./utils/setup-xorg.sh
  ./utils/start-xorg.sh
  sleep 15
  export DISPLAY=$DR_DISPLAY
  sudo nohup xinit /usr/bin/jwm -- /usr/lib/xorg/Xorg $DISPLAY -config $DR_DIR/tmp/xorg.conf > $DR_DIR/tmp/xorg.log 2>&1 & sleep 1
  nohup xrandr -s 1400x900
  nohup x11vnc -bg -forever -no6 -nopw -rfbport 5901 -rfbportv6 -1 -display WAIT$DISPLAY & sleep 1
  xauth generate $DISPLAY
  export XAUTHORITY=~/.Xauthority
  sudo DISPLAY=:$DISPLAY XAUTHORITY=$(ps aux | grep "X.*\-auth" | grep -v grep | sed -n 's/.*-auth \([^ ]\+\).*/\1/p') xhost +
fi
larsll commented 3 months ago

Hmm, that piece of code is a bit of a mystery. On my test with GPU + OpenGL then you only need to do setup-xorg.sh and start-xorg.sh; the rest seems to be pieces copied together from prepare.sh and those scripts... (And there has been changes as to how xorg starts; have a look at those updated scripts...)

MarkRoss-Eviden commented 3 months ago

Hmm, that piece of code is a bit of a mystery. On my test with GPU + OpenGL then you only need to do setup-xorg.sh and start-xorg.sh; the rest seems to be pieces copied together from prepare.sh and those scripts... (And there has been changes as to how xorg starts; have a look at those updated scripts...)

It's code that I added to get OpenGL to work, prior to adding this it did not work. It could be because the approach Tyler took in the original set-up of DOTS is to run a bunch of things initially to 'bake' an AMI (where prepare and init is ran) to speed up deployment, but it means when you deploy the instance you're running a few things specifically on that new instance. It's part of this wider code that runs on the new instance from the AMI: -

!/bin/bash

            source /etc/profile.d/my_custom_files.sh
            aws sns publish --topic-arn $MY_SNS_TOPIC --message "Training has initiated on a new instance.  The new url to monitor progress is http://$PUBLIC_IP:8100/menu.html" --region $DEEPRACER_REGION
            cd ~/deepracer-for-cloud
            git pull
            sed -i "s/DR_UPLOAD_S3_BUCKET=not-defined/DR_UPLOAD_S3_BUCKET=$DEEPRACER_S3_URI/" ~/deepracer-for-cloud/system.env
            sed -i "s/DR_LOCAL_S3_BUCKET=bucket/DR_LOCAL_S3_BUCKET=$DEEPRACER_S3_URI/" ~/deepracer-for-cloud/system.env
            sed -i "s/DR_UPLOAD_S3_PREFIX=upload/DR_UPLOAD_S3_PREFIX=$DR_LOCAL_S3_MODEL_PREFIX-upload/" ~/deepracer-for-cloud/run.env
            sed -i "s|DR_LOCAL_S3_CUSTOM_FILES_PREFIX=custom_files|DR_LOCAL_S3_CUSTOM_FILES_PREFIX=$CUSTOM_FILE_LOCATION|" ~/deepracer-for-cloud/run.env
            source bin/activate.sh
            dr-download-custom-files
            cp custom_files/*.env .
            dr-reload
            # Setup required config if using OpenGL training
            if [[ $DR_HOST_X == True ]];then
              distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed 's/\.//')
              sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
              sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64/7fa2af80.pub
              echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
              echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda_learn.list
              sudo apt update && sudo apt install -y nvidia-driver-470-server cuda-minimal-build-11-4 --no-install-recommends -o Dpkg::Options::="--force-overwrite"
              distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
              curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
              curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
              sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime xserver-xorg-dev xutils-dev
              ./utils/setup-xorg.sh
              ./utils/start-xorg.sh
              sleep 15
              export DISPLAY=$DR_DISPLAY
              sudo nohup xinit /usr/bin/jwm -- /usr/lib/xorg/Xorg $DISPLAY -config $DR_DIR/tmp/xorg.conf > $DR_DIR/tmp/xorg.log 2>&1 & sleep 1
              nohup xrandr -s 1400x900
              nohup x11vnc -bg -forever -no6 -nopw -rfbport 5901 -rfbportv6 -1 -display WAIT$DISPLAY & sleep 1
              xauth generate $DISPLAY
              export XAUTHORITY=~/.Xauthority
              sudo DISPLAY=:$DISPLAY XAUTHORITY=$(ps aux | grep "X.*\-auth" | grep -v grep | sed -n 's/.*-auth \([^ ]\+\).*/\1/p') xhost +
            fi
            # There is a bug where at some times the training fails to start, so we start, stop and start it again to reduce the occurrences of this issue. 
            nohup /bin/bash -lc 'cd ~/deepracer-for-cloud/; dr-start-training -qw; sleep 120; dr-stop-training; sleep 60; echo y | docker container prune; dr-reload; dr-start-training -qwv' &
            mkdir -p /tmp/logs/
            # We want to be able to monitor our EC2 training without needing to connect to console, so we upload all needed info to Public_IP:8100/menu.html using this script
            nohup /bin/bash -lc 'source /home/ubuntu/bin/web_monitoring.sh >/dev/null 2>&1' &
            sleep 180 > /dev/null
            while [ True ]; do
                # if the EC2 started termination process upon interruption notification, this file should exist, hence we leave termination process to manage final uploads without conflict
                if [[ -f /home/ubuntu/bin/termination.started ]];then
                  break
                fi
                # Update variable references before every iteration in case of any change on the config files
                source ~/deepracer-for-cloud/bin/activate.sh

                for name in `docker ps -a --format "{{.Names}}"`; do
                    docker logs ${name} > /tmp/logs/${name}.log 2>&1
                done
                # Only upload best Checkpoint if best Checkpoint has changed
                bestcheckpoint=$(echo n | dr-upload-model -b 2>&1 | grep "checkpoint:")
                aws s3 cp /tmp/logs/ s3://$DEEPRACER_S3_URI/$DR_LOCAL_S3_MODEL_PREFIX/logs/ --recursive
                rm -rf /tmp/logs/*.* > /dev/null 2>&1
                if [ [ "$bestcheckpoint" != "$lastbestcheckpoint" ] && [ "$bestcheckpoint" != "" ] ];then
                  # update file timestamp just to avoid conflict with termination process
                  touch /home/ubuntu/bin/uploading_best_model.timestamp 2>&1
                  dr-upload-model -bfw > /dev/null 2>&1
                  lastbestcheckpoint=$bestcheckpoint
                fi
                sleep 120
            done

I'll have to do some further testing. Also how long does it take and does it still need reboots now you've stripped back the install as perhaps we could do away with the AMI approach and just run from a fresh Ubuntu if it's only a short amount at startup (the AMI approach was designed to reduce time from creating an instance to training starting)

MarkRoss-Eviden commented 3 months ago

Error on using OpenGL relates to runnignt he ./utils/setup-xorg.sh script, output below: -

Reading package lists... Done Building dependency tree Reading state information... Done screen is already the newest version (4.8.0-1ubuntu0.1). screen set to manually installed. The following additional packages will be installed: libmotif-common libtcl8.6 libtk8.6 libvncclient1 libvncserver1 libxcb-shape0 libxcomposite1 libxcursor1 libxdamage1 libxft2 libxi6 libxinerama1 libxm4 libxrandr2 libxss1 libxtst6 libxv1 libxxf86dga1 tcl tcl8.6 tk tk8.6 xbitmaps Suggested packages: menu-l10n gksu | kde-runtime | ktsuss tcl-tclreadline nickle cairo-5c xorg-docs-core xfonts-cyrillic Recommended packages: xserver-xorg | xserver The following NEW packages will be installed: libmotif-common libtcl8.6 libtk8.6 libvncclient1 libvncserver1 libxcb-shape0 libxcomposite1 libxcursor1 libxdamage1 libxft2 libxi6 libxinerama1 libxm4 libxrandr2 libxss1 libxtst6 libxv1 libxxf86dga1 menu mesa-utils mwm pkg-config tcl tcl8.6 tk tk8.6 x11-utils x11-xserver-utils x11vnc xbitmaps xinit xserver-xorg-legacy xterm 0 upgraded, 33 newly installed, 0 to remove and 0 not upgraded. Need to get 5833 kB of archives. After this operation, 19.7 MB of additional disk space will be used. Get:1 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 libmotif-common all 2.3.8-2build1 [10.8 kB] Get:2 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxft2 amd64 2.3.3-0ubuntu1 [39.2 kB] Get:3 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 libxm4 amd64 2.3.8-2build1 [993 kB] Get:4 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 menu amd64 2.1.47ubuntu4 [354 kB] Get:5 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 mwm amd64 2.3.8-2build1 [171 kB] Get:6 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libtcl8.6 amd64 8.6.10+dfsg-1 [902 kB] Get:7 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxss1 amd64 1:1.2.3-1 [8140 B] Get:8 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libtk8.6 amd64 8.6.10-1 [714 kB] Get:9 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/main amd64 libvncclient1 amd64 0.9.12+dfsg-9ubuntu0.3 [65.6 kB] Get:10 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/main amd64 libvncserver1 amd64 0.9.12+dfsg-9ubuntu0.3 [119 kB] Get:11 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxcb-shape0 amd64 1.14-2 [5928 B] Get:12 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxcomposite1 amd64 1:0.4.5-1 [6976 B] Get:13 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxcursor1 amd64 1:1.2.0-2 [20.1 kB] Get:14 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxdamage1 amd64 1:1.1.5-2 [6996 B] Get:15 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxi6 amd64 2:1.7.10-0ubuntu1 [29.9 kB] Get:16 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxinerama1 amd64 2:1.1.4-2 [6904 B] Get:17 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxrandr2 amd64 2:1.5.2-0ubuntu1 [18.5 kB] Get:18 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxtst6 amd64 2:1.2.3-1 [12.8 kB] Get:19 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxv1 amd64 2:1.0.11-1 [10.7 kB] Get:20 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 libxxf86dga1 amd64 2:1.1.5-0ubuntu1 [12.0 kB] Get:21 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 pkg-config amd64 0.29.1-0ubuntu4 [45.5 kB] Get:22 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 tcl8.6 amd64 8.6.10+dfsg-1 [14.8 kB] Get:23 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 tcl amd64 8.6.9+1 [5112 B] Get:24 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 tk8.6 amd64 8.6.10-1 [12.5 kB] Get:25 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 tk amd64 8.6.9+1 [3240 B] Get:26 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 x11-utils amd64 7.7+5 [199 kB] Get:27 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 x11-xserver-utils amd64 7.7+8 [162 kB] Get:28 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 x11vnc amd64 0.9.16-3 [1006 kB] Get:29 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 xbitmaps all 1.1.1-2 [28.1 kB] Get:30 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/main amd64 xinit amd64 1.4.1-0ubuntu2 [17.9 kB] Get:31 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/universe amd64 xterm amd64 353-1ubuntu1.20.04.2 [765 kB] Get:32 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 mesa-utils amd64 8.4.0-1build1 [34.2 kB] Get:33 http://eu-north-1.ec2.archive.ubuntu.com/ubuntu focal-updates/main amd64 xserver-xorg-legacy amd64 2:1.20.13-1ubuntu1~20.04.15 [33.5 kB] Fetched 5833 kB in 1s (6745 kB/s) Extracting templates from packages: 100% Preconfiguring packages ... Selecting previously unselected package libmotif-common. (Reading database ... 82514 files and directories currently installed.) Preparing to unpack .../00-libmotif-common_2.3.8-2build1_all.deb ... Unpacking libmotif-common (2.3.8-2build1) ... Selecting previously unselected package libxft2:amd64. Preparing to unpack .../01-libxft2_2.3.3-0ubuntu1_amd64.deb ... Unpacking libxft2:amd64 (2.3.3-0ubuntu1) ... Selecting previously unselected package libxm4:amd64. Preparing to unpack .../02-libxm4_2.3.8-2build1_amd64.deb ... Unpacking libxm4:amd64 (2.3.8-2build1) ... Selecting previously unselected package menu. Preparing to unpack .../03-menu_2.1.47ubuntu4_amd64.deb ... Unpacking menu (2.1.47ubuntu4) ... Selecting previously unselected package mwm. Preparing to unpack .../04-mwm_2.3.8-2build1_amd64.deb ... Unpacking mwm (2.3.8-2build1) ... Selecting previously unselected package libtcl8.6:amd64. Preparing to unpack .../05-libtcl8.6_8.6.10+dfsg-1_amd64.deb ... Unpacking libtcl8.6:amd64 (8.6.10+dfsg-1) ... Selecting previously unselected package libxss1:amd64. Preparing to unpack .../06-libxss1_1%3a1.2.3-1_amd64.deb ... Unpacking libxss1:amd64 (1:1.2.3-1) ... Selecting previously unselected package libtk8.6:amd64. Preparing to unpack .../07-libtk8.6_8.6.10-1_amd64.deb ... Unpacking libtk8.6:amd64 (8.6.10-1) ... Selecting previously unselected package libvncclient1:amd64. Preparing to unpack .../08-libvncclient1_0.9.12+dfsg-9ubuntu0.3_amd64.deb ... Unpacking libvncclient1:amd64 (0.9.12+dfsg-9ubuntu0.3) ... Selecting previously unselected package libvncserver1:amd64. Preparing to unpack .../09-libvncserver1_0.9.12+dfsg-9ubuntu0.3_amd64.deb ... Unpacking libvncserver1:amd64 (0.9.12+dfsg-9ubuntu0.3) ... Selecting previously unselected package libxcb-shape0:amd64. Preparing to unpack .../10-libxcb-shape0_1.14-2_amd64.deb ... Unpacking libxcb-shape0:amd64 (1.14-2) ... Selecting previously unselected package libxcomposite1:amd64. Preparing to unpack .../11-libxcomposite1_1%3a0.4.5-1_amd64.deb ... Unpacking libxcomposite1:amd64 (1:0.4.5-1) ... Selecting previously unselected package libxcursor1:amd64. Preparing to unpack .../12-libxcursor1_1%3a1.2.0-2_amd64.deb ... Unpacking libxcursor1:amd64 (1:1.2.0-2) ... Selecting previously unselected package libxdamage1:amd64. Preparing to unpack .../13-libxdamage1_1%3a1.1.5-2_amd64.deb ... Unpacking libxdamage1:amd64 (1:1.1.5-2) ... Selecting previously unselected package libxi6:amd64. Preparing to unpack .../14-libxi6_2%3a1.7.10-0ubuntu1_amd64.deb ... Unpacking libxi6:amd64 (2:1.7.10-0ubuntu1) ... Selecting previously unselected package libxinerama1:amd64. Preparing to unpack .../15-libxinerama1_2%3a1.1.4-2_amd64.deb ... Unpacking libxinerama1:amd64 (2:1.1.4-2) ... Selecting previously unselected package libxrandr2:amd64. Preparing to unpack .../16-libxrandr2_2%3a1.5.2-0ubuntu1_amd64.deb ... Unpacking libxrandr2:amd64 (2:1.5.2-0ubuntu1) ... Selecting previously unselected package libxtst6:amd64. Preparing to unpack .../17-libxtst6_2%3a1.2.3-1_amd64.deb ... Unpacking libxtst6:amd64 (2:1.2.3-1) ... Selecting previously unselected package libxv1:amd64. Preparing to unpack .../18-libxv1_2%3a1.0.11-1_amd64.deb ... Unpacking libxv1:amd64 (2:1.0.11-1) ... Selecting previously unselected package libxxf86dga1:amd64. Preparing to unpack .../19-libxxf86dga1_2%3a1.1.5-0ubuntu1_amd64.deb ... Unpacking libxxf86dga1:amd64 (2:1.1.5-0ubuntu1) ... Selecting previously unselected package pkg-config. Preparing to unpack .../20-pkg-config_0.29.1-0ubuntu4_amd64.deb ... Unpacking pkg-config (0.29.1-0ubuntu4) ... Selecting previously unselected package tcl8.6. Preparing to unpack .../21-tcl8.6_8.6.10+dfsg-1_amd64.deb ... Unpacking tcl8.6 (8.6.10+dfsg-1) ... Selecting previously unselected package tcl. Preparing to unpack .../22-tcl_8.6.9+1_amd64.deb ... Unpacking tcl (8.6.9+1) ... Selecting previously unselected package tk8.6. Preparing to unpack .../23-tk8.6_8.6.10-1_amd64.deb ... Unpacking tk8.6 (8.6.10-1) ... Selecting previously unselected package tk. Preparing to unpack .../24-tk_8.6.9+1_amd64.deb ... Unpacking tk (8.6.9+1) ... Selecting previously unselected package x11-utils. Preparing to unpack .../25-x11-utils_7.7+5_amd64.deb ... Unpacking x11-utils (7.7+5) ... Selecting previously unselected package x11-xserver-utils. Preparing to unpack .../26-x11-xserver-utils_7.7+8_amd64.deb ... Unpacking x11-xserver-utils (7.7+8) ... Selecting previously unselected package x11vnc. Preparing to unpack .../27-x11vnc_0.9.16-3_amd64.deb ... Unpacking x11vnc (0.9.16-3) ... Selecting previously unselected package xbitmaps. Preparing to unpack .../28-xbitmaps_1.1.1-2_all.deb ... Unpacking xbitmaps (1.1.1-2) ... Selecting previously unselected package xinit. Preparing to unpack .../29-xinit_1.4.1-0ubuntu2_amd64.deb ... Unpacking xinit (1.4.1-0ubuntu2) ... Selecting previously unselected package xterm. Preparing to unpack .../30-xterm_353-1ubuntu1.20.04.2_amd64.deb ... Unpacking xterm (353-1ubuntu1.20.04.2) ... Selecting previously unselected package mesa-utils. Preparing to unpack .../31-mesa-utils_8.4.0-1build1_amd64.deb ... Unpacking mesa-utils (8.4.0-1build1) ... Selecting previously unselected package xserver-xorg-legacy. Preparing to unpack .../32-xserver-xorg-legacy_2%3a1.20.13-1ubuntu1~20.04.15_amd64.deb ... Unpacking xserver-xorg-legacy (2:1.20.13-1ubuntu1~20.04.15) ... Setting up xinit (1.4.1-0ubuntu2) ... Setting up libxft2:amd64 (2.3.3-0ubuntu1) ... Setting up libxdamage1:amd64 (1:1.1.5-2) ... Setting up libxi6:amd64 (2:1.7.10-0ubuntu1) ... Setting up libxtst6:amd64 (2:1.2.3-1) ... Setting up libxcursor1:amd64 (1:1.2.0-2) ... Setting up libxcb-shape0:amd64 (1.14-2) ... Setting up libxxf86dga1:amd64 (2:1.1.5-0ubuntu1) ... Setting up libmotif-common (2.3.8-2build1) ... Setting up libvncserver1:amd64 (0.9.12+dfsg-9ubuntu0.3) ... Setting up libvncclient1:amd64 (0.9.12+dfsg-9ubuntu0.3) ... Setting up mesa-utils (8.4.0-1build1) ... Setting up libxinerama1:amd64 (2:1.1.4-2) ... Setting up libxv1:amd64 (2:1.0.11-1) ... Setting up libxrandr2:amd64 (2:1.5.2-0ubuntu1) ... Setting up libtcl8.6:amd64 (8.6.10+dfsg-1) ... Setting up pkg-config (0.29.1-0ubuntu4) ... Setting up libxss1:amd64 (1:1.2.3-1) ... Setting up menu (2.1.47ubuntu4) ... Setting up libxcomposite1:amd64 (1:0.4.5-1) ... Setting up xserver-xorg-legacy (2:1.20.13-1ubuntu1~20.04.15) ... Setting up xbitmaps (1.1.1-2) ... Setting up libxm4:amd64 (2.3.8-2build1) ... Setting up tcl8.6 (8.6.10+dfsg-1) ... Setting up libtk8.6:amd64 (8.6.10-1) ... Setting up x11-xserver-utils (7.7+8) ... Setting up mwm (2.3.8-2build1) ... update-alternatives: using /usr/bin/mwm to provide /usr/bin/x-window-manager (x-window-manager) in auto mode Setting up tcl (8.6.9+1) ... Setting up x11-utils (7.7+5) ... Setting up xterm (353-1ubuntu1.20.04.2) ... update-alternatives: using /usr/bin/xterm to provide /usr/bin/x-terminal-emulator (x-terminal-emulator) in auto mode update-alternatives: using /usr/bin/lxterm to provide /usr/bin/x-terminal-emulator (x-terminal-emulator) in auto mode Setting up tk8.6 (8.6.10-1) ... Setting up tk (8.6.9+1) ... Setting up x11vnc (0.9.16-3) ... Processing triggers for mime-support (3.64ubuntu1) ... Processing triggers for hicolor-icon-theme (0.17-2) ... Processing triggers for libc-bin (2.31-0ubuntu9.14) ... Processing triggers for man-db (2.9.1-1) ... Processing triggers for install-info (6.7.0.dfsg.2-5) ... Processing triggers for menu (2.1.47ubuntu4) ...

ERROR: Unable to query GPU information

nvidia-xconfig: option "--busid" requires an argument.

Invalid commandline, please run nvidia-xconfig --help for usage information.

BUS_ID var in the script is not being set. running the command that sets the BUS_ID var results in: - image

EC2 instance does have a GPU :-), it's a g4dn.2xlarge I've tested on.

Looking through the PR I noticed this as thought it might be related to not being able to find the GPU info: - sudo apt install -y nvidia-driver-525-server --no-install-recommends -o Dpkg::Options::="--force-overwrite"

Post install of that line I can now get back the GPU info: - image

So it appears the problem is that the updated DRfC code isn't appropriately detecting the GPU on the EC2 instance and running the code to install the nvidia drivers @larsll?

MarkRoss-Eviden commented 3 months ago

Existing DOTS setup output: -

NVidia SMI output when using GPU sagemaker and CPU robomaker (taken from nginx output as everything works) - image

NVidia SMI output when using GPU sagemaker and CPU robomaker with OpenGL config (taken from terminal as nginx etc doesn't come up): - image

I'll try a fresh build next

MarkRoss-Eviden commented 3 months ago

GPU for robomaker and sagemaker on OpenGL works now having removed the old nvidia driver install and replacing with up to date one: - image

GPU for sagemaker, CPU for robomaker with PenGL also now works: - image

Think we're good to go @larsll