dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.37k stars 482 forks source link

Question : Cuda version support #258

Open rahulswa08 opened 1 year ago

rahulswa08 commented 1 year ago

Hi @dusty-nv , I'm currently using ros:noetic-pytorch-l4t-r34.1.1 base image on Jetson AGX Orin 32GB with cuda version 11.4 installed. However I need cuda version 11.8 in my docker, for this do I need to upgrade cuda on Jetson? Or can I perform upgrade to 11.8- cuda on this image?

dusty-nv commented 1 year ago

Hi @rahulswa08, on JetPack 5, CUDA/cuDNN/TensorRT/ect are installed inside the container (unlike JetPack 4, where they get mounted into the container from the host device by the NVIDIA runtime). So you would just perform the upgrade inside the container. I've not tried changing the CUDA version before though.

rahulswa08 commented 1 year ago

Thanks @dusty-nv , As the docker have its own CUDA I have tried upgrading the CUDA on docker using the instructions provided here.

Please ensure your device is configured per the [CUDA Tegra Setup Documentation](https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#upgradable-package-for-jetson).
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/arm64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
sudo dpkg -i cuda-tegra-repo-ubuntu2004-11-8-local_11.8.0-1_arm64.deb
sudo cp /var/cuda-tegra-repo-ubuntu2004-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

But when I perform the update I face the following issues at the last step sudo apt-get -y install cuda:

The following packages have unmet dependencies:
cuda : Depends: cuda-11.8 (>= 11.8) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

I'm not sure why I'm facing this. I'm able to perform update on jetson by following the same steps but not able to do on docker.

Am I doing anything wrong here? or it this a limitation?

Could you help me solve this issue

Thanks!!

dusty-nv commented 1 year ago

@rahulswa08 can you try installing the cuda-11.8 package instead of cuda ? Or maybe try the --only-upgrade flag to apt-get? I haven't upgraded CUDA before in the containers.

rahulswa08 commented 1 year ago

I have tried installing cuda-11.8 but it leads to some other dependency and that leads to another. And I'm unable to update it by trying to install them. I haven't tried --only-upgrade option.

dusty-nv commented 1 year ago

If --only-upgrade doesn't work and you are unable to resolve the dependencies, you could try uninstalling the previous CUDA from the container first. Or it may be cleaner for you just to start with l4t-base, then install your desired CUDA Toolkit/ect on top of that, then PyTorch and so on.

hillct commented 1 year ago

I've encountered the same issues, starting from each of: nvcr.io/nvidia/l4t-cuda:11.4.19-devel, nvcr.io/nvidia/l4t-cuda:11.4.19-runtime, nvcr.io/nvidia/l4t-base:35.4.1, nvcr.io/nvidia/l4t-base:35.3.1 and nvcr.io/nvidia/l4t-base:35.2.1 when following h documented procedure found here: https://developer.nvidia.com/cuda-11-8-0-download-archive?target_os=Linux&target_arch=aarch64-jetson&Compilation=Native&Distribution=Ubuntu&target_version=20.04&target_type=deb_local

Having tested both network and local repo methodologies, the network repo seems to be targeted toward the the muli-platform CUDA images for example https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags as evryhing is cross-dependant on cuda-12.2 packages (essentially a documentation issue for the above webpage) bu when pinned to cuda 11.8, he behavior is the same as with the local repo methodology wherein you get circular dependencies among the various CUDA packages at 11.8. So far I've not tested h various force or ignore dependencies approaches as hey would inevitably lead to unstable images. Certainly the preferred approach would be to resolve he underlying circular dependency issue.

hillct commented 1 year ago

As it turns out, the dependency tree ends at he unresolvable dependency on nvidia-l4t-core which is a board suppor package mean for he hos hardware, not containers. The dependency itself seems o be a holdover from the Jepack 4.5.x days when CUDA was meant to run outside the containers. The issue might be resolvable by correcting and rebuilding cuda-compat-11-8

For reference, the (consolidated) tree looks like this:

# apt-get install cuda-11.8
 cuda-11-8 : Depends: cuda-runtime-11-8 (>= 11.8.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
# apt-get install cuda-runtime-11-8
cuda-runtime-11-8 : Depends: cuda-compat-11-8 (>= 11.8.31339915) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
# apt-get install cuda-compat-11.8
 cuda-compat-11-8 : PreDepends: nvidia-l4t-core but it is not installable
E: Unable to correct problems, you have held broken packages.

Further discussion of his issue related to nvidia-l4t-core (while no directly on point) can be found here https://forums.developer.nvidia.com/t/installing-nvidia-l4t-core-package-in-a-docker-layer/153412

johnnynunez commented 1 year ago

As it turns out, the dependency tree ends at he unresolvable dependency on nvidia-l4t-core which is a board suppor package mean for he hos hardware, not containers. The dependency itself seems o be a holdover from the Jepack 4.5.x days when CUDA was meant to run outside the containers. The issue might be resolvable by correcting and rebuilding cuda-compat-11-8

For reference, the (consolidated) tree looks like this:

# apt-get install cuda-11.8
 cuda-11-8 : Depends: cuda-runtime-11-8 (>= 11.8.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
# apt-get install cuda-runtime-11-8
cuda-runtime-11-8 : Depends: cuda-compat-11-8 (>= 11.8.31339915) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
# apt-get install cuda-compat-11.8
 cuda-compat-11-8 : PreDepends: nvidia-l4t-core but it is not installable
E: Unable to correct problems, you have held broken packages.

Further discussion of his issue related to nvidia-l4t-core (while no directly on point) can be found here https://forums.developer.nvidia.com/t/installing-nvidia-l4t-core-package-in-a-docker-layer/153412

https://hackmd.io/ZmWQz8azTdWNVoCc9Bf3QA If not wait for jetpack 6 end of the month

hillct commented 1 year ago

https://hackmd.io/ZmWQz8azTdWNVoCc9Bf3QA

@johnnynunez Congratulations on your article, but it doesn't seem to address the issue at hand - that being deploying CUDA 11.8 INSIDE a container.

If not wait for jetpack 6 end of the month

I'm also a bit baffled by your assertion that the release of Jetpack 6 might include recompilation and correction of the dependency flaw, especially since no such recompilation was completed as part of the Jetpack 5.x roadmap. If you have information that this differs for the 6.0 release, please share that documented roadmap.

johnnynunez commented 1 year ago

https://hackmd.io/ZmWQz8azTdWNVoCc9Bf3QA

@johnnynunez Congratulations on your article, but it doesn't seem to address the issue at hand - that being deploying CUDA 11.8 INSIDE a container.

If not wait for jetpack 6 end of the month

I'm also a bit baffled by your assertion that the release of Jetpack 6 might include recompilation and correction of the dependency flaw, especially since no such recompilation was completed as part of the Jetpack 5.x roadmap. If you have information that this differs for the 6.0 release, please share that documented roadmap.

Only @dusty-nv OR @tokk-nv can confirm somethings here.

  1. The problem is that driver is old.
  2. Cuda 12.3 is not compatible with jetson.
  3. If you upgrade the problem is still existing, because there are libraries like Cudnn that which are private, and only you can download pre-compiled and for jetson not exists the urls with latest version of cudnn.

So we can only wait for Jetpack 6.0 because:

  1. uncompile linux from jetpack(you can install any distro).
  2. You can install any linux kernel.
  3. Jetpack 6 comes with cuda 12.2.

I do not work in Nvidia, but I think the idea of Nvidia, is to pass the jetson as if it were a gpu, being able to install open dkms kernel and have precompilations of cudnn and other libraries on the order of the day as have other devices such as Grace Hopper (based on ARM)

dusty-nv commented 1 year ago

@johnnynunez @hillct here is another thread to keep an eye on: https://forums.developer.nvidia.com/t/use-cuda-12-2-in-a-container/271600

dusty-nv commented 1 year ago

OK, I found a workaround for this by manually extracting the cuda-compat deb inside the container, and then installing cuda-toolkit or cuda-libraries package instead (only cuda and cuda-runtime depend on cuda-compat/nvidia-l4t-core)

#
# sudo docker build --network=host --tag cuda:12.2 .
# sudo docker run --runtime nvidia -it --rm --network host cuda:12.2 cuda-samples/bin/aarch64/linux/release/deviceQuery
#
FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
            wget \
            git \
            binutils \
            xz-utils \
            ca-certificates \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# download the CUDA Toolkit local installer
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/arm64/cuda-ubuntu2004.pin -O /etc/apt/preferences.d/cuda-repository-pin-600 && \
    wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda-tegra-repo-ubuntu2004-12-2-local_12.2.2-1_arm64.deb && \
    dpkg -i cuda-tegra-repo-*.deb && \
    rm cuda-tegra-repo-*.deb 

# add the signed keys
RUN cp /var/cuda-tegra-repo-*/cuda-tegra-*-keyring.gpg /usr/share/keyrings/

# manually extract cuda-compat
RUN mkdir /var/cuda-compat && \
    cd /var/cuda-compat && \
    ar x ../cuda-tegra-repo-*/cuda-compat-*.deb && \
    tar xvf data.tar.xz -C / && \
    rm -rf /var/cuda-compat

# install cuda-toolkit (doesn't depend on cuda-compat/nvidia-l4t-core)
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
            cuda-toolkit-* \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# environment variables 
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=all

ENV CUDA_HOME="/usr/local/cuda"
ENV PATH="/usr/local/cuda/bin:${PATH}"
ENV LD_LIBRARY_PATH="/usr/local/cuda/compat:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"

# build cuda samples
RUN git clone --branch=v12.2 https://github.com/NVIDIA/cuda-samples && \
    cd cuda-samples/Samples/1_Utilities/deviceQuery && \
    make

WORKDIR /

Tried this on a board running JetPack 5.1.2 / L4T R35.4.1, which did not have CUDA 12.2 installed outside the container - and it worked (YMMV)

0Unkn0wn commented 1 year ago

Thank you very much @dusty-nv it worked on the first try, and no problems were encountered.

hillct commented 1 year ago

Just for completeness, in case others come across this issue in the future, the alternate approach is to force the installation of the dependency as in this example. It should be noted you can specify CUDA=11-8 or CUDA=12-2 to get the desired resuls a build time.

ARG BASE_IMAGE=nvcr.io/nvidia/l4t-base:35.3.1
FROM ${BASE_IMAGE} as base
ARG DEBIAN_FRONTEND=noninteractive
ARG sm=87
ARG USE_DISTRIBUTED=1                    # skip setting this if you want to enable OpenMPI backend
ARG USE_QNNPACK=0
ARG CUDA=11-8
# nvidia-l4t-core is a dependency for the rest
# of the packages, and is designed to be installed directly
# on the target device. This because it parses /proc/device-tree
# in the deb's .preinst script. Looks like we can bypass it though:
RUN \
    echo "deb https://repo.download.nvidia.com/jetson/common r35.3 main" >> /etc/apt/sources.list && \
    echo "deb https://repo.download.nvidia.com/jetson/t194 r35.3 main" >> /etc/apt/sources.list && \
    apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc && \
    mkdir -p /opt/nvidia/l4t-packages/ && \
    touch /opt/nvidia/l4t-packages/.nv-l4t-disable-boot-fw-update-in-preinstall && \
    rm -f /etc/ld.so.conf.d/nvidia-tegra.conf && apt-get update && \
    apt-get install -y --no-install-recommends nvidia-l4t-core && \
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/arm64/cuda-keyring_1.0-1_all.deb && \
    dpkg -i cuda-keyring_1.0-1_all.deb && apt-get update && apt-get install -y --no-install-recommends cuda-${CUDA} && \
    apt-get -y upgrade &&  apt-get clean && rm -rf /var/lib/apt/lists/* cuda-keyring_1.0-1_all.deb

I've not yet done a comparison of the final images bu given the methodology, it's likely that @dusty-nv's approach would be slower to build (owing to the large download requirement) but of similar final size

Vektor284 commented 1 month ago

Hello,

I have a board running JetPack 5.1.4 / L4T R35.4.1. I am working on a project that requires Python 3.9 and Cuda 12.2. I can get @dusty-nv solution working and I can get Pytorch installed. However, when I checked for the presence of the GPU, using torch.cuda.is_available(), it returns None. The same is true when the setup script checks for $CUDA_HOME. Some of my dependencies required these to compile.

So far, my steps have been to create the container image as indicated by dusty and then use this image to create the container with Python 3.9 and the rest of my project.

Any help and or advice is greatly appreciated.