dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.75k stars 2.97k forks source link

Docker fails after reboot. #1795

Open smbunn opened 7 months ago

smbunn commented 7 months ago

I did a clean install of JetPack 6.0DP on my Jetson Orin Nano. Onto a 500 Gb NVMe drive. All good and eveything installed. Then installed jetson-inference and jetson-containers from this site. Everything runs perfectly. Tested almost all examples in inference and all good.

The I rebooted my Jetson. Now docker fails to run. If I run sudo systemctl status docker

It just says it failed to load. No real error message beyond docker.service: Failed with result 'error code'. Process is ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE

Any ideas?

dusty-nv commented 7 months ago

Hi @smbunn, can you run sudo journalctl -u docker.service for more detailed logs hopefully? If you followed steps from the System Setup, had you previously rebooted since changing your /etc/docker/daemon.json ? Had you run apt-get upgrade?

smbunn commented 7 months ago

I followed all the steps including cutting and pasting inito the daemon.json file. I have accepted the software updates prompted by Ubuntu 22.04 on my Jetson Orin Nano. Attached is the output from the journalctl request journalctl.txt

smbunn commented 7 months ago

Is it the "failed to register bridge" that is the issue?

dusty-nv commented 7 months ago

https://forums.docker.com/t/docker-service-failed-job-for-docker-service-failed-because-the-control-process-exited-with-error-code/139843/3

Yes, in the jetson-inference and jetson-containers install there is no apt upgrade

smbunn commented 7 months ago

So never use apt-get update or apt-get upgrade, nor accept Software Updates? I tried the sudo apt reinstall docker-ce suggested by the link you sent but at the end I get Could not execute systemctl: at /usr/bin/deb-systemd-invoke line 142.

Do I really have to start again from scratch and keep my Jetson Nano completely un-updated forever?

dusty-nv commented 7 months ago

I'm not sure what all packages are affected due to that bug, but you could then use apt mark hold on them until its resolved.

smbunn commented 7 months ago

The only thing I did that I will really miss is the long process to have OpenCV with CUDA enabled. That took quite a few steps and quite a bit of effort. Hours of effort. I imaged my drive after I did this but overwrote that image with the working copy with jetson-inference installed. Now that image is not working as of course I hadn't rebooted it :-(

dusty-nv commented 7 months ago

You can find links to my OpenCV+CUDA binaries here - they are tarballs of opencv deb's that get installed in the containers by opencv_install.sh

https://github.com/dusty-nv/jetson-containers/blob/d8992335108db11b4e003db0d4cf03cf2a1cb5b6/packages/opencv/config.py#L25

Sorry, if you can't fix what the upgrade did to docker, then yea I would re-flash one more time and then not upgrade it until issue is resolved or you know what packages to pin.

smbunn commented 7 months ago

Can I use SDK Manager on my other Ubuntu machine to just re-install from fresh or do I have to boot with the jumper installed and format the NVMe drive?

dusty-nv commented 7 months ago

I just flash the OS to eMMC or SD card, and put all my projects/data/containers on the NVME. Did you have the OS installed to the NVME?

Regardless, to you would need to boot into recovery mode using the jumper, while USB-C cable is attached to your other Ubuntu machine where SDK Manager will flash it from.

smbunn commented 7 months ago

I have the jumper installed and using lsusb on my other ubuntu machine but I dont see it at all. Looks like I might have to remove the NVMe drive, install it on my ubuntu server, format it, re-install it on the jeston and try again. No idea why the Jetson is not responding to a cold re-initialize

dusty-nv commented 7 months ago

The NVME drive shouldn't impact it going into recovery mode or not. Double check the jumper location and USB connection are correct, and if you having issues flashing the device I would recommend you post it to the forums. You could quickly remove your NVME from your Jetson just to eliminate any possibility of that but I don't believe it would prevent going into recovery mode.

smbunn commented 7 months ago

Nowhere have I ever seen a diagram on which two pins to jumper. I followed the video on you tube https://www.youtube.com/watch?v=Ucg5Zqm9ZMk and guessed from the camera angle that they were the last but two pins to the right hand end of the header under the fan unit. The YouTube clip calls them 9 and 10 but they appear to be labelled FC Rec and GND. Right?

smbunn commented 7 months ago

My USB-C socket needed crimping a bit with needle nose pliers. Now the connection is secure and I have everything back to base. I am using Jetpack 6.0DP. This is the latest and I think it is Ubuntu Pro, can anyone confirm this?

dusty-nv commented 7 months ago

@smbunn JetPack 6.0 comes with Ubuntu 22.04

smbunn commented 7 months ago

You can find links to my OpenCV+CUDA binaries here - they are tarballs of opencv deb's that get installed in the containers by opencv_install.sh

https://github.com/dusty-nv/jetson-containers/blob/d8992335108db11b4e003db0d4cf03cf2a1cb5b6/packages/opencv/config.py#L25

Sorry, if you can't fix what the upgrade did to docker, then yea I would re-flash one more time and then not upgrade it until issue is resolved or you know what packages to pin.

Does everything have to be in a separate container? What do I do if I just want opencv with cuda suppport in the main environment? You mention deb files above but I couldn't find them.

smbunn commented 7 months ago

Note above where I said "Is it the "failed to register bridge" that is the issue?" Looks like this is the case according to this thread on the NVIDIA site. https://forums.developer.nvidia.com/t/docker-gives-error-after-upgrading-ubuntu/283563

smbunn commented 7 months ago

I looked at run.sh and it has --network host which to me seems like it never uses the bridge docker0. But docker itself cannot start up and fails on the bridge connection. I have tried the repair indicated in the nvidia developers forum, but like many others have had to report that this does not work. So at present I maintain a clean 'just installed" image of my Jetson Orin Nano with JetPack 6.0 and try and make sure no updates ever occur. This gets tricky as some installs want you to run update and upgrade. Hopefully someone will figure out how to repair the bridge in Docker so we can resume normal Ubuntu updates processes. When I accidentally allow a script to run that updates, I clone my NVMe drive back to the original and start again. I have now done this so often I keep 2 NVMe drives on the go so I can always swap to the clean one when docker fails.

I will add that this is still worthwhile as the dusty-nv containers are awesome! I am really enjoying using them.

nlitz88 commented 7 months ago

@smbunn We ran into this same issue on our AGX Orin running Jetpack 6. I'm not sure what we did differently, but we ran the exact commands mentioned in the post you linked above (where we ran the commands specified in the issue they link to first) and that seemed to do the trick. We are now able to run the image using the run_dev.sh script.

AZSupra commented 7 months ago

I just tried this and it worked for me. Someone posted it in the Nvidia forum thread mentioned above sudo update-alternatives --set iptables /usr/sbin/iptables-legacy sudo apt reinstall docker-ce

smbunn commented 7 months ago

I also did the set iptables to legacy commands and this worked perfectly. It was discussed on https://forums.developer.nvidia.com/t/docker-gives-error-after-upgrading-ubuntu/283563/11

smbunn commented 7 months ago

You can find links to my OpenCV+CUDA binaries here - they are tarballs of opencv deb's that get installed in the containers by opencv_install.sh

https://github.com/dusty-nv/jetson-containers/blob/d8992335108db11b4e003db0d4cf03cf2a1cb5b6/packages/opencv/config.py#L25

Sorry, if you can't fix what the upgrade did to docker, then yea I would re-flash one more time and then not upgrade it until issue is resolved or you know what packages to pin.

Figured out how to do this. Ubuntu is still fairly new to me. Found your gz files and the DEB name from the link you gave above Then ran :

sudo ./jetson-containers/packages/opencv/opencv_install.sh https://nvidia.box.com/shared/static/ngp26xb9hb7dqbu6pbs7cs9flztmqwg0.gz OpenCV-4.8.1-aarch64.tar.gz

Worked a treat and now I have openCV 4.8.1 installed in the base environment with CUDA support

Thanks Dusty!