Qengineering / Jetson-Nano-Ubuntu-20-image

Jetson Nano with Ubuntu 20.04 image
https://qengineering.eu/install-ubuntu-20.04-on-jetson-nano.html
BSD 3-Clause "New" or "Revised" License
702 stars 74 forks source link

CUDA Installation failed in bare-bones Ubuntu 20.04. See log at /var/log/cuda-installer.log for details. #57

Open VbsmRobotic opened 10 months ago

VbsmRobotic commented 10 months ago

Hello everyone,

I'm encountering some challenges with bare-bones Ubuntu 20.04 image for installing CUDA. Has anyone come across similar issues? Here's the process I've been following:

1- Download the CUDA installer using the following command:
    $ wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda_11.6.2_510.47.03_linux_sbsa.run
2- Run the installer with elevated privileges:
    $ sudo sh cuda_11.6.2_510.47.03_linux_sbsa.run

Unfortunately, the installation failed, and I'm advised to check the log at /var/log/cuda-installer.log for more details. Any insights or solutions would be greatly appreciated.

CUDA-Driver

error

$ cat /var/log/cuda-installer.log INFO: Driver not installed. INFO: Checking compiler version... INFO: gcc location: /usr/bin/gcc

INFO: gcc version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)

INFO: Initializing menu INFO: Setup complete INFO: Components to install:

INFO: Executing NVIDIA-Linux-aarch64-510.47.03.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1 INFO: Finished with code: 36096 [ERROR]: Install of driver component failed. [ERROR]: Install of 510.47.03 failed, quitting

Qengineering commented 10 months ago

Sorry, you can not install CUDA 11 on a Jetson Nano, due to low-level incompatibility. The 'regular' CUDA version is 10 and is already installed. No need to use the CUDA installer. Assuming we are talking about the 'old' Jetson Nano, not the Orion

VbsmRobotic commented 10 months ago

Thank you for your prompt response. I appreciate the clarification about CUDA compatibility on the Jetson Nano. However, when I check the CUDA version using nvcc --version, it seems that I can't find the installed CUDA version. Could you kindly provide guidance on how to resolve this issue? Thank you. nvcc_V

Qengineering commented 10 months ago

nvcc should be located in folder /usr/local/cuda/bin/. Please incorporate the location into your PATH string

VbsmRobotic commented 10 months ago

Thank you for your generous help. I've successfully incorporated the changes into the bashrc file and verified the CUDA version is now visible.

VbsmRobotic commented 10 months ago

Hello, I am reaching out for guidance based on the information provided in the following link: https://forums.developer.nvidia.com/t/pytorch-for-jetson/72048.

To install PyTorch on my Jetson Nano, I've created a virtual environment using Python 3.6, as specified in the requirements for JetPack 4. However, during the installation process, I encountered the following error:

(py_env) jetson@nano:~$ python3 Python 3.6.15 (default, Nov 15 2023, 11:27:50) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch Traceback (most recent call last): File "", line 1, in File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 195, in _load_global_deps() File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 148, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/local/lib/python3.6/ctypes/init.py", line 348, in init self._handle = _dlopen(self._name, mode) OSError: libmpi_cxx.so.20: cannot open shared object file: No such file or directory

I would greatly appreciate your advice on resolving this issue. Your assistance is invaluable to me at this stage.

Thank you in advance

Qengineering commented 10 months ago

Tip: ask chatGPT. It can give valuable answers. In your case: The error you're encountering indicates that the libmpi_cxx.so.20 shared library cannot be found. This library is part of the Message Passing Interface (MPI) library. It seems like there might be an issue with your MPI installation or the environment variables related to it.

Here are a few steps you can take to address this issue:

  1. Check MPI Installation: Make sure that MPI is correctly installed on your system. You may need to reinstall MPI or ensure that the required libraries are available. On a Debian-based system, you can use the following command to install MPI:

    sudo apt-get install libopenmpi-dev

    If you are using a different package manager or operating system, adjust the command accordingly.

  2. Set Environment Variable: If MPI is correctly installed, you may need to set the LD_LIBRARY_PATH environment variable to include the directory where libmpi_cxx.so.20 is located. You can do this by adding the following line to your shell profile file (e.g., ~/.bashrc or ~/.bash_profile):

    export LD_LIBRARY_PATH=/path/to/mpi/lib:$LD_LIBRARY_PATH

    Replace /path/to/mpi/lib with the actual path to the directory containing the MPI libraries.

  3. Rebuild PyTorch: If you are using a virtual environment and installed PyTorch within that environment, consider deactivating the virtual environment, then reactivate it and reinstall PyTorch. This can sometimes resolve compatibility issues:

    deactivate
    source py_env/bin/activate
    pip install torch

    Make sure to replace py_env with the actual name of your virtual environment.

  4. Update PyTorch: Ensure that you are using the latest version of PyTorch. You can upgrade PyTorch using the following command:

    pip install --upgrade torch

    This will install the latest version of PyTorch and its dependencies.

After performing these steps, try running your Python script again. If the issue persists, there may be other system-specific factors at play, and additional troubleshooting may be needed.

VbsmRobotic commented 10 months ago

Thank you for your message. I have successfully set the Environment Variable. To locate the libmpi, I used the following command: $ find / -name libmpi_cxx* 2>/dev/null /usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi_cxx.so.40.20.1 /usr/lib/aarch64-linux-gnu/openmpi/lib/libmpi_cxx.so /usr/lib/aarch64-linux-gnu/libmpi_cxx.so.40.20.1 /usr/lib/aarch64-linux-gnu/libmpi_cxx.so /usr/lib/aarch64-linux-gnu/libmpi_cxx.so.40 Additionally, I've added the following line to the bashrc file to address the PyTorch installation issue: export LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH

Following the instructions from this link, I executed the following commands for PyTorch installation: $ wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl $ sudo apt-get install python3-pip libopenblas-base libopenmpi-dev libomp-dev $ pip3 install 'Cython<3' $ pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl

The installation was successful with the following packages installed: Successfully installed Cython-0.29.36 Successfully installed dataclasses-0.8 numpy-1.19.5 torch-1.8.0 typing-extensions-4.1.1

However, I encountered an issue when attempting to install torchvision. Following the instructions here, I executed the following commands: $ sudo apt-get install libjpeg-dev zlib1g-dev libpython3-dev libopenblas-dev libavcodec-dev libavformat-dev libswscale-dev $ git clone --branch v0.9.0 https://github.com/pytorch/vision torchvision $ cd torchvision $ export BUILD_VERSION=0.9.0 $ python3 setup.py install --user

Unfortunately, I encountered the same error: (py_env) jetson@nano:~/vahid_ws/Jetson-Nano-OCR-Detection/PyTorchJetson_JetPack4/torchvision$ python3 setup.py install --user Traceback (most recent call last): File "setup.py", line 12, in import torch File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 195, in _load_global_deps() File "/home/jetson/vahid_ws/Jetson-Nano-OCR-Detection/build/config_virtualenv/py_env/lib/python3.6/site-packages/torch/init.py", line 148, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/local/lib/python3.6/ctypes/init.py", line 348, in init self._handle = _dlopen(self._name, mode) OSError: libmpi_cxx.so.20: cannot open shared object file: No such file or directory

I appreciate your assistance in resolving this issue. Any advice you can provide would be invaluable at this stage. Thank you in advance.

KalanaRatnayake commented 4 months ago

Thank you for your generous help. I've successfully incorporated the changes into the bashrc file and verified the CUDA version is now visible.

Can you share what commands did you use? facing the same issue