High level guidance/status on CUDA 12.6 and L4T 36.4.0

TangmereCottage commented 1 month ago

Struggling here with NanoLLM, mlc llm, torch, and torchvision on CUDA 12.6 and 36.4.0.

Ask: I would be grateful for high level status info - will 12.6 be broadly supported soon, or, shall I downgrade to 12.2 (e.g.) and wait a few weeks for the dust to settle?

Main issue - trying out various "hello world" type commands across jetson-containers results in error messages and little else.

BTW thanks @dusty-nv for everything you are doing and this heroic effort, especially during the CUDA 12.2->12.6 transition. Seems like step 1 is to:

$ git pull # to get all the most recent bugfixes
$ CUDA_VERSION=12.6 jetson-containers build transformers
# wait an hour
Done building container transformers:r36.4.0-cu126
# woohoo!

$ jetson-containers run $(autotag l4t-pytorch)
Namespace(packages=['l4t-pytorch'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.4.0  JETPACK_VERSION=6.1  CUDA_VERSION=12.6
-- Finding compatible container image for ['l4t-pytorch']
[sudo] password: 
Found compatible container dustynv/l4t-pytorch:r36.4.0 (2024-09-30, 6.3GB) - would you like to pull it? [Y/n] Y
dustynv/l4t-pytorch:r36.4.0
V4L2_DEVICES: 
csi_indexes: 
localuser:root being added to access control list
+ docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/jan/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-3 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-6 --device /dev/i2c-7 --device /dev/i2c-8 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock --name my_jetson_container dustynv/l4t-pytorch:r36.4.0

# dies with 
docker: unknown server OS: .

Additional info - this is a super-boring, no modifications, bare metal fresh install using the Nvidia SDK, with everything docker moved to SSD mnt/docker per your instructions.

-- L4T_VERSION=36.4.0
-- JETPACK_VERSION=6.1
-- CUDA_VERSION=12.6
-- PYTHON_VERSION=3.10
-- LSB_RELEASE=22.04 (jammy)

TangmereCottage commented 1 month ago

Ah, so, the docker: unknown server OS: . was my fault - forgot to run newgrp docker after sudo usermod -aG docker $USER. After that jetson-containers run $(autotag l4t-pytorch) works on 12.6! Yay.

Hi-level skinny for everyone:

$ git pull # to get all the most recent bugfixes
$ CUDA_VERSION=12.6 jetson-containers build transformers

dusty-nv commented 1 month ago

Hi @TangmereCottage, for ROS I have pushed dustynv/ros:humble-desktop-l4t-r36.4.0 awhile, the ROS builds seem to be working fine after we got OpenCV squared. MLC I have pushed the wheels for 0.1.0 so far, so if you try to build that container, it should install them from my pip server as opposed to needing to compile it all.

NanoLLM I won't be able to fully check out until next week due to some other obligations I have, sorry about that and thanks for your understanding and support. You could try building it and giving it a go, most things have been working fine so far. Llama.cpp and ollama are up, both those and MLC have OpenAI-compliant servers. Good luck!

smbunn commented 1 month ago

Thanks for the rapid roll out Dustin. I suggest you add the newgrp docker information to the setup page information as I had the same issue and saw it resolved here.

Fibo27 commented 1 month ago

@dusty-nv I echo the previous comments on the awesome work that you have been doing and understand that it will take sometime before everything normalizes. While creating the image for ros2 with vlm, i have been able to build the following images

The build process stops at Jetson-Inference - see attached build log. The error "opt/jetson-inference/c/tensorNet.cpp:29:10: fatal error: NvCaffeParser.h: No such file or directory" i presume is due to Tensorrt10 nano_llm_iron-r36.4.0-cu126-jetson-inference_main.txt

Best

dusty-nv commented 1 month ago

Yea, jetson-inference is still needing updated for TRT10, I am most of the way there with it (hopefully) but had to leave for a trip. Should be back on it later this week. In this case... jetson-inference itself doesn't actually get used in the NanoLLM software (but jetson_utils does). So you might be able to rig it to pass the build for now.

TangmereCottage commented 1 month ago

jetson-containers run $(autotag nano_llm)   python3 -m nano_llm.vision.video     --model Efficient-Large-Model/VILA1.5-3b

All nano_llm with VILA models seem to die in some version of this (corrupted size vs. prev_size):

Finish exporting to ...
corrupted size vs. prev_size

Fibo27 commented 1 month ago

Yea, jetson-inference is still needing updated for TRT10, I am most of the way there with it (hopefully) but had to leave for a trip. Should be back on it later this week. In this case... jetson-inference itself doesn't actually get used in the NanoLLM software (but jetson_utils does). So you might be able to rig it to pass the build for now.

I tried building by changing line 33 in config.py and replacing Jetson-inference with jetson-utils. The build process started and it went on to build the transformers image but the Jetson-inference build process restarted. I presume there is more rigging to be done - I couldn't figure that out though