NVIDIA-ISAAC-ROS / isaac_ros_common

Common utilities, packages, scripts, Dockerfiles, and testing infrastructure for Isaac ROS packages.
https://developer.nvidia.com/isaac-ros-gems
Other
197 stars 140 forks source link

cuda / gpu not available (agx orin) in docker container #29

Closed bblumberg closed 2 years ago

bblumberg commented 2 years ago

Before I get to what I ran into let me say that this is going to be an awesome framework! Congrats and thank-you. I was able to install isaac_ros_common and additional docker container on my linux workstation and found it quite straightforward and had no issues.

However, when I repeated the process on my agx orin (latest software), the base isaac_ros_common container does not seem to have access to the gpu/cuda.

For example: admin@agx-orin:/workspaces/isaac_ros-dev$ python3 Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch print(torch.version) 1.12.0 torch.cuda.is_available() False

I have another docker container that was previously built using nvcr.io/nvidia/l4t-pytorch:r34.1.0-pth1.12-py3, and cuda and the gpu are available, and I didn't notice any significant difference in arguments as run_dev.sh, which suggests that everything to support using the gpu in a container is set up correctly:

Python 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch print(torch.version) 1.12.0a0+2c916ef.nv22.3 torch.cuda.is_available() True

So I am at a loss to understand what the issue is. Here are things that I have checked:

  1. All software on the orin is up to date
  2. I followed the set-up instructions so the nvidia container toolkit etc. are correct and the correct versions.

If there is additional information, I can provide, please let me know. Thanks for your help.

bb

bblumberg commented 2 years ago

Here is what I discovered:

  1. From some quick testing, the version of PyTorch that is available on the PyTorch website, and which is used in building the aarch64 container, does not seem to be compatible with the L4T CUDA container, or the L4T Base container (and pulling in the host cuda libraries).
  2. However, when I used the version that is available from nvidia: https://docs.nvidia.com/deeplearning/frameworks/install-pytorch-jetson-platform/index.html, it all worked (well, actually, there is a known torchvision issue). So this should be the version that you use for jetson builds?

bb

hemalshahNV commented 2 years ago

Thanks @bblumberg for your detailed conclusions! We use NVIDIA Deep Learning Frameworks base images on x86 (with PyTorch installed), but use the public PyTorch debians for L4T. We'll take a look at this and straighten out as soon as we can.

hemalshahNV commented 2 years ago

PyTorch with CUDA should not be available in the new Isaac ROS Dev Base images released with Isaac ROS DP1.1. torchvision wasn't updated in time, however.