hubdwoo commented 9 months ago

The aim of this task is to fix the graphics drivers inside the ENV

[x] Run BalenaOS inside the ENV
[x] Run docker environment nvidia-drivers enabled to run the sensors

hubdwoo commented 9 months ago

10-10-23

Tried to install BalenaOS. Didn't get up to the installation but for some reason it has some of BalenaOS data inside the hard drive. Because it loads up the logo but it freezes afterwards.

It seems like BalenaOS doesn't have installation step. It installs it right away without selecting it into any drive. Which is annoying, because now the old Ubuntu 22.04 is overwritten.

Currently to fix this, I've installed another Ubuntu 22.04 but OEM version on the other 2TB drive.

Might want to try BalenaOS with MBR instead of GPT because currently I am using GPT. It doesn't really affect much in terms of OS installation but like I think what BalenaOS is referring to is that with the legacy MBR it has support for BIOS while the GPT has support for UEFI. Technically BalenaOS for GPT should work because if it supports UEFI it should support BIOS but worth a try I guess.
Another possible way to do this is to use another flash drive. I think the flash drive in the workshop isn't functioning correctly.

AnthonyZhOon commented 9 months ago

To run nvcc, needed to use the devel branch instead of base for dev tolls. sudo docker run -d --runtime=nvidia --gpus all -it nvidia/cuda:12.2.0-devel-ubuntu22.04 was used to create the nvidia container

hubdwoo commented 8 months ago

23-10-23

Things that I have tried:

Followed this article
Reset the cmos battery for the-hive-1
Since the cmos battery is reset, the SATA operation mode is changed from AHCI to RAID and therefore unable to boot in properly

What to do next:

Try BalenaOS on personal computer because there is more control over the bios
Might ask Streetdrone on how did they install BalenaOS

Ways to fix the the-hive-1:

Contact eSolutions to get the bios password in order to change the SATA operations
Tried stuff from this article
I believe this is a much better solution compared to the previous article

hubdwoo commented 8 months ago

29-10-23

What has been done

Installed a new BalenaOS into the computer. It turns out that we have successfully installed it before inside the ENV200 and it only shows a static logo. This is the expected outcome. To connect we need access it through the BalenaCloud platform with the machine connected with ethernet to the internet
Just like the previous logs. It does not have installation wizard.

What next

Figure out on how to implement the graphics drivers inside the docker container
Improve in terms of the networking side because it seems that it is very slow at the moment
Research more about BalenaOS
Check if it works if the user isn't in the same local network as the fleet/server

hubdwoo commented 8 months ago

30-10-23

278876540-2019d90e-8a00-464b-bf2b-817dd2c37fa1

Logging in using ssh keeps on failing. This local connection to the BalenaOS-enabled machine is needed due to the fact that I have to push docker files into the machine to run containerised application.

Might want to do some configuration to the os image, because there's options to add ssh keys and stuff using the balena-cli.

hubdwoo commented 8 months ago

7-11-23

Balena-cli has an option for Balena SSH. It works by first sending the connection to Balena's server(from the ones that I have connected it seems that the servers are located in Europe). Then, it forwards the connection to my machine. This means that connecting to the machine has a huge overhead.

An alternative is to use SSH with port 22222 to connect to it. It does not have any authentication methods in place. There are a few things to note where using normal SSH is not permitted such as when we are using production image instead of development. Like my previous post, using production image does not enable the option for using normal ssh. Balena Thread

There are also some differences in using prod and dev image for 'balena ssh' command as well.

Right now I am trying to create a docker file to push it into the Balena machine

hubdwoo commented 8 months ago

9-11-23

I tried creating a custom Docker file which derives the image from an Ubuntu 22.04 and installed the nvidia-drivers. It didn't work. Upon reading more from this article. I found these examples.

Which is a docker file that Balena has made that can run CUDA with Nvidia-drivers. But when I tried it, it shows an error when I try to do nvidia-smi.

hubdwoo commented 8 months ago

12-11-23

I have tried keeping the same balenaOS but change the nvidia-drivers to newest one to make sure it works. It still does not work.

These errors pop up gpu rmmod: ERROR: Module nouveau is in use gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia.ko: Invalid module format gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-modeset.ko: Invalid module format gpu insmod: ERROR: could not insert module /nvidia/driver/nvidia-uvm.ko: Invalid module format gpu NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I tried changing the BalenaOS to the current version that we have with the Balena-Machine-Name that we have. But, still does not work. When pushing the container with the latest version of the BalenaOS, it seems that the address does not exist. [gpu] Step 13/21 : RUN curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)" [gpu] ---> Running in b392845bd4e4 [gpu] curl: (22) The requested URL returned error: 404 [gpu] [gpu] [gpu] gzip: stdin: unexpected end of file [gpu] [gpu] tar: Child returned status 1 [gpu] tar: Error is not recoverable: exiting now [gpu] [gpu] Removing intermediate container b392845bd4e4 [gpu] The command '/bin/bash -o pipefail -c curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)"' returned a non-zero code: 2 [Info] Uploading images [Success] Successfully uploaded images [Error] Some services failed to build: [Error] Service: gpu [Error] Error: The command '/bin/bash -o pipefail -c curl -fsSL "https://files.balena-cloud.com/images/${BALENA_MACHINE_NAME}/${VERSION}/kernel_source.tar.gz" | tar xz --strip-components=2 && make -C build modules_prepare -j"$(nproc)"' returned a non-zero code: 2 [Info] Built on 9214c56 [Error] Not deploying release. Remote build failed

dylan-gonzalez commented 8 months ago

BalenaOS has been installed by @hubdwoo on the Hive 1

To access it remotely: Go to Balena Cloud login user: mcavvrav@gmail.com pass: Balena1&!

To access locally: ssh root@192.168.10.157 -p 22222

hubdwoo commented 8 months ago

@Jiawei-Liao @dylan-gonzalez Use this notion to get the private key to ssh to the hive through the Balena's VPN. Apparently I something is wrong with the website which does not allow me to type in a new key.

hubdwoo commented 7 months ago

What to do next

9
Use Nvidia open source drivers
Maybe pass in nomodeset parameter to the kernel

hubdwoo commented 7 months ago

7-12-23

Using the Open Source NVIDIA drivers works by

Building the Modules manually
Unload the Nouveau drivers
Load the Nvidia drivers
Only CLI works

What next:

Try and starting X server with the desktop environment, etc
Check hardware according to nvidia-smi command there is an "N/A" means that it couldn't 100% talk with the GPU. Meaning there might be compatibility issues with the motherboard

hubdwoo commented 7 months ago

12-12-23

Nvidia thread on XInit error message

Download the .run file in nvidia website
Run the .run file with these configuration sudo ./NVIDIA-Linux-<uname-r>-<drivers version>.run -m=kernel-open
Run sudo update-initramfs -u
Then reboot

The -m flag is for specifying kernel directory modules. At the time of this writing there is only 2 modules, open-source one and the normal proprietary one. Specifying -m=kernel-open uses the open-source kernel modules.

The NVIDIA.run is a file with compressed data, it is a shell script and within it there is the kernel modules itself imprinted in binary(or maybe I don't have the right decoder). Once its executed, it will unpack the binary and run the installer.

hubdwoo commented 7 months ago

12-2-23

I am going to close this issue since it is fix. On 14-2-23 I will install CUDA & run a CUDA C program to test it using CUDA. If there are any issues with that, I will open up the issue again.

hubdwoo commented 7 months ago

14-2-23

CUDA installed & I have tried running a CUDA program and it works.

Monash-Connected-Autonomous-Vehicle / ENV200-Shadow-Repo

Drivers Fix #7

The aim of this task is to fix the graphics drivers inside the ENV

10-10-23

23-10-23

29-10-23

30-10-23

7-11-23

9-11-23

12-11-23

What to do next

9

7-12-23

12-12-23

12-2-23

14-2-23