Monash-Connected-Autonomous-Vehicle / ENV200-Shadow-Repo

0 stars 0 forks source link

Drivers Fix #7

Closed hubdwoo closed 7 months ago

hubdwoo commented 9 months ago

The aim of this task is to fix the graphics drivers inside the ENV

hubdwoo commented 9 months ago

10-10-23

Tried to install BalenaOS. Didn't get up to the installation but for some reason it has some of BalenaOS data inside the hard drive. Because it loads up the logo but it freezes afterwards.

It seems like BalenaOS doesn't have installation step. It installs it right away without selecting it into any drive. Which is annoying, because now the old Ubuntu 22.04 is overwritten.

Currently to fix this, I've installed another Ubuntu 22.04 but OEM version on the other 2TB drive.

Image

AnthonyZhOon commented 9 months ago

Image

To run nvcc, needed to use the devel branch instead of base for dev tolls. sudo docker run -d --runtime=nvidia --gpus all -it nvidia/cuda:12.2.0-devel-ubuntu22.04 was used to create the nvidia container

hubdwoo commented 8 months ago

23-10-23

Things that I have tried:

What to do next:

Ways to fix the the-hive-1:

hubdwoo commented 8 months ago

29-10-23

What has been done

What next

hubdwoo commented 8 months ago

30-10-23

278876540-2019d90e-8a00-464b-bf2b-817dd2c37fa1

Logging in using ssh keeps on failing. This local connection to the BalenaOS-enabled machine is needed due to the fact that I have to push docker files into the machine to run containerised application.

Might want to do some configuration to the os image, because there's options to add ssh keys and stuff using the balena-cli.

hubdwoo commented 8 months ago

7-11-23

image

Balena-cli has an option for Balena SSH. It works by first sending the connection to Balena's server(from the ones that I have connected it seems that the servers are located in Europe). Then, it forwards the connection to my machine. This means that connecting to the machine has a huge overhead.

An alternative is to use SSH with port 22222 to connect to it. It does not have any authentication methods in place. There are a few things to note where using normal SSH is not permitted such as when we are using production image instead of development. Like my previous post, using production image does not enable the option for using normal ssh. Balena Thread

There are also some differences in using prod and dev image for 'balena ssh' command as well.

image

Right now I am trying to create a docker file to push it into the Balena machine

hubdwoo commented 8 months ago

9-11-23

I tried creating a custom Docker file which derives the image from an Ubuntu 22.04 and installed the nvidia-drivers. It didn't work. Upon reading more from this article. I found these examples.

Which is a docker file that Balena has made that can run CUDA with Nvidia-drivers. But when I tried it, it shows an error when I try to do nvidia-smi.

hubdwoo commented 8 months ago

12-11-23

I have tried keeping the same balenaOS but change the nvidia-drivers to newest one to make sure it works. It still does not work.

These errors pop up gpu rmmod: ERROR: Module nouveau is in use gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: Invalid module format gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: Invalid module format gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: Invalid module format gpu NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I tried changing the BalenaOS to the current version that we have with the Balena-Machine-Name that we have. But, still does not work. When pushing the container with the latest version of the BalenaOS, it seems that the address does not exist. [gpu] Step 13/21 : RUN curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)" [gpu] ---> Running in b392845bd4e4 [gpu] curl: (22) The requested URL returned error: 404 [gpu] [gpu] [gpu] gzip: stdin: unexpected end of file [gpu] [gpu] tar: Child returned status 1 [gpu] tar: Error is not recoverable: exiting now [gpu] [gpu] Removing intermediate container b392845bd4e4 [gpu] The command '/bin/bash -o pipefail -c curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)"' returned a non-zero code: 2 [Info] Uploading images [Success] Successfully uploaded images [Error] Some services failed to build: [Error] Service: gpu [Error] Error: The command '/bin/bash -o pipefail -c curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)"' returned a non-zero code: 2 [Info] Built on 9214c56 [Error] Not deploying release. Remote build failed

dylan-gonzalez commented 8 months ago

BalenaOS has been installed by @hubdwoo on the Hive 1

To access it remotely: Go to Balena Cloud login user: mcavvrav@gmail.com pass: Balena1&!

To access locally: ssh root@192.168.10.157 -p 22222

hubdwoo commented 8 months ago

@Jiawei-Liao @dylan-gonzalez Use this notion to get the private key to ssh to the hive through the Balena's VPN. Apparently I something is wrong with the website which does not allow me to type in a new key.

hubdwoo commented 7 months ago

What to do next

hubdwoo commented 7 months ago

7-12-23

Using the Open Source NVIDIA drivers works by

What next:

image

hubdwoo commented 7 months ago

12-12-23

Image

Nvidia thread on XInit error message

The -m flag is for specifying kernel directory modules. At the time of this writing there is only 2 modules, open-source one and the normal proprietary one. Specifying -m=kernel-open uses the open-source kernel modules.

The NVIDIA.run is a file with compressed data, it is a shell script and within it there is the kernel modules itself imprinted in binary(or maybe I don't have the right decoder). Once its executed, it will unpack the binary and run the installer.

hubdwoo commented 7 months ago

12-2-23

I am going to close this issue since it is fix. On 14-2-23 I will install CUDA & run a CUDA C program to test it using CUDA. If there are any issues with that, I will open up the issue again.

hubdwoo commented 7 months ago

14-2-23

CUDA installed & I have tried running a CUDA program and it works.

Image