dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.37k stars 482 forks source link

High level guidance/status on CUDA 12.6 and L4T 36.4.0 #663

Open TangmereCottage opened 1 month ago

TangmereCottage commented 1 month ago

Struggling here with NanoLLM, mlc llm, torch, and torchvision on CUDA 12.6 and 36.4.0.

Ask: I would be grateful for high level status info - will 12.6 be broadly supported soon, or, shall I downgrade to 12.2 (e.g.) and wait a few weeks for the dust to settle?

Main issue - trying out various "hello world" type commands across jetson-containers results in error messages and little else.

BTW thanks @dusty-nv for everything you are doing and this heroic effort, especially during the CUDA 12.2->12.6 transition. Seems like step 1 is to:

$ git pull # to get all the most recent bugfixes
$ CUDA_VERSION=12.6 jetson-containers build transformers
# wait an hour
Done building container transformers:r36.4.0-cu126
# woohoo!

$ jetson-containers run $(autotag l4t-pytorch)
Namespace(packages=['l4t-pytorch'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.4.0  JETPACK_VERSION=6.1  CUDA_VERSION=12.6
-- Finding compatible container image for ['l4t-pytorch']
[sudo] password: 
Found compatible container dustynv/l4t-pytorch:r36.4.0 (2024-09-30, 6.3GB) - would you like to pull it? [Y/n] Y
dustynv/l4t-pytorch:r36.4.0
V4L2_DEVICES: 
csi_indexes: 
localuser:root being added to access control list
+ docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/jan/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-3 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-6 --device /dev/i2c-7 --device /dev/i2c-8 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock --name my_jetson_container dustynv/l4t-pytorch:r36.4.0

# dies with 
docker: unknown server OS: .

Additional info - this is a super-boring, no modifications, bare metal fresh install using the Nvidia SDK, with everything docker moved to SSD mnt/docker per your instructions.

-- L4T_VERSION=36.4.0
-- JETPACK_VERSION=6.1
-- CUDA_VERSION=12.6
-- PYTHON_VERSION=3.10
-- LSB_RELEASE=22.04 (jammy)
TangmereCottage commented 1 month ago

Ah, so, the docker: unknown server OS: . was my fault - forgot to run newgrp docker after sudo usermod -aG docker $USER. After that jetson-containers run $(autotag l4t-pytorch) works on 12.6! Yay.

Hi-level skinny for everyone:

$ git pull # to get all the most recent bugfixes
$ CUDA_VERSION=12.6 jetson-containers build transformers
dusty-nv commented 1 month ago

Hi @TangmereCottage, for ROS I have pushed dustynv/ros:humble-desktop-l4t-r36.4.0 awhile, the ROS builds seem to be working fine after we got OpenCV squared. MLC I have pushed the wheels for 0.1.0 so far, so if you try to build that container, it should install them from my pip server as opposed to needing to compile it all.

NanoLLM I won't be able to fully check out until next week due to some other obligations I have, sorry about that and thanks for your understanding and support. You could try building it and giving it a go, most things have been working fine so far. Llama.cpp and ollama are up, both those and MLC have OpenAI-compliant servers. Good luck!

smbunn commented 1 month ago

Thanks for the rapid roll out Dustin. I suggest you add the newgrp docker information to the setup page information as I had the same issue and saw it resolved here.

Fibo27 commented 1 month ago

@dusty-nv I echo the previous comments on the awesome work that you have been doing and understand that it will take sometime before everything normalizes. While creating the image for ros2 with vlm, i have been able to build the following images image image

The build process stops at Jetson-Inference - see attached build log. The error "opt/jetson-inference/c/tensorNet.cpp:29:10: fatal error: NvCaffeParser.h: No such file or directory" i presume is due to Tensorrt10 nano_llm_iron-r36.4.0-cu126-jetson-inference_main.txt

Best

dusty-nv commented 1 month ago

Yea, jetson-inference is still needing updated for TRT10, I am most of the way there with it (hopefully) but had to leave for a trip. Should be back on it later this week. In this case... jetson-inference itself doesn't actually get used in the NanoLLM software (but jetson_utils does). So you might be able to rig it to pass the build for now.

TangmereCottage commented 1 month ago
jetson-containers run $(autotag nano_llm)   python3 -m nano_llm.vision.video     --model Efficient-Large-Model/VILA1.5-3b 

All nano_llm with VILA models seem to die in some version of this (corrupted size vs. prev_size):

Finish exporting to ...
corrupted size vs. prev_size
Fibo27 commented 1 month ago

Yea, jetson-inference is still needing updated for TRT10, I am most of the way there with it (hopefully) but had to leave for a trip. Should be back on it later this week. In this case... jetson-inference itself doesn't actually get used in the NanoLLM software (but jetson_utils does). So you might be able to rig it to pass the build for now.

I tried building by changing line 33 in config.py and replacing Jetson-inference with jetson-utils. The build process started and it went on to build the transformers image but the Jetson-inference build process restarted. I presume there is more rigging to be done - I couldn't figure that out though

johnnynunez commented 1 month ago

Yea, jetson-inference is still needing updated for TRT10, I am most of the way there with it (hopefully) but had to leave for a trip. Should be back on it later this week. In this case... jetson-inference itself doesn't actually get used in the NanoLLM software (but jetson_utils does). So you might be able to rig it to pass the build for now.

I tried building by changing line 33 in config.py and replacing Jetson-inference with jetson-utils. The build process started and it went on to build the transformers image but the Jetson-inference build process restarted. I presume there is more rigging to be done - I couldn't figure that out though

I will check it

Fibo27 commented 1 month ago

@johnnynunez Any update on this?

johnnynunez commented 1 month ago

@johnnynunez Any update on this?

I'm still checking

dusty-nv commented 1 month ago

my next idea was to be change jetson-inference dockerfile to depend on tensorrt:8.6 instead and see if that could work in the meantime... I just got back from a conference and will be looking into this soon but am underwater, sorry to keep you guys waiting.

johnnynunez commented 1 month ago

my next idea was to be change jetson-inference dockerfile to depend on tensorrt:8.6 instead and see if that could work in the meantime... I just got back from a conference and will be looking into this soon but am underwater, sorry to keep you guys waiting.

I'm building docker with jetson utils instead of jetson-inference with cuda 12.6. I'm checking if all is working. I have problems with onnxruntime

johnnynunez commented 1 month ago

@dusty-nv onnxruntime-gpu is not compiling. I don’t know if it is tensorrt or what (1.19.2) is pointing to 10.2

johnnynunez commented 1 month ago

@dusty-nv jax also is not building

Fibo27 commented 1 month ago

@dusty-nv jetson-inference, torch2trt, onnxruntime are not building - i presume there is a connect with tensorrt

johnnynunez commented 1 month ago

@dusty-nv jax also is not building

this is fixed with hermetic cuda

johnnynunez commented 1 month ago

@dusty-nv jetson-inference, torch2trt, onnxruntime are not building - i presume there is a connect with tensorrt

jetson-inference because is still using tensorrt8 torch2trt is building for me with latest 36.4 onnxruntime, I have to check why. I think that now jetson is more recent cuda stack than onnxruntime stack, but onnxruntime is matching version with jetson in 1.20.0 https://onnxruntime.ai/roadmap

Fibo27 commented 1 month ago

ok should i start the build by deleting the previous containers? I have already deleted the cache

dusty-nv commented 1 month ago

Here is the initial build of NanoLLM container for JP6.1, let me know if it works for you: dustynv/nano_llm:r36.4.0

Fibo27 commented 1 month ago

Thanks it does. I notice that Ros is not included - i presume i can use this as the base image and build on it - is that correct?