OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
419 stars 230 forks source link

Steps to build a yocto image with nvidia-container-tools #448

Closed mjemv closed 3 years ago

mjemv commented 4 years ago

Hi,

I am trying to build a yocto image for jetson nano, with doctor-ce and having support of nvidia-container-tools. Stumbled upon a guide https://blogs.windriver.com/wind_river_blog/2020/05/nvidia-container-runtime-for-wind-river-linux/

I am not using WR linux and my bblayers looks like

 `  
/media/ubuntu/Data1/yocto/poky/meta \
/media/ubuntu/Data1/yocto/poky/meta-poky \
/media/ubuntu/Data1/yocto/poky/meta-yocto-bsp \
/media/ubuntu/Data1/yocto/meta-tegra \
/media/ubuntu/Data1/yocto/meta-openembedded/meta-oe \
/media/ubuntu/Data1/yocto/meta-openembedded/meta-multimedia \
/media/ubuntu/Data1/yocto/meta-openembedded/meta-networking \
/media/ubuntu/Data1/yocto/meta-openembedded/meta-filesystems \
/media/ubuntu/Data1/yocto/meta-virtualization \
/media/ubuntu/Data1/yocto/meta-openembedded/meta-python \
`

local.conf

ONF_VERSION = "1"
MACHINE = "jetson-nano-qspi-sd"

LICENSE_FLAGS_WHITELIST = "commercial"

IMAGE_CLASSES += "image_types_tegra"

IMAGE_FSTYPES = "tegraflash"

GCCVERSION = "7.%"
DISTRO_FEATURES_append = " virtualization"
ENABLE_UART = "1"
IMAGE_INSTALL_append = " docker-ce"
NVIDIA_DEVNET_MIRROR='file:///home/ubuntu/Downloads/nvidia/sdkm_downloads'
CUDA_VERSION="10.0"
PARALLEL_MAKE = "-j 32"
BB_NUMBER_THREADS="32"
IMAGE_INSTALL_append = " nvidia-docker nvidia-container-runtime cudnn tensorrt libvisionworks libvisionworks-sfm libvisionworks-tracking cuda-container-csv cudnn-container-csv tensorrt-container-csv libvisionworks-container-csv libvisionworks-sfm-container-csv libvisionworks-tracking-container-csv"

Nothing RPROVIDES 'cuda-container-csv'

Not sure what I am missing. If there are any steps please let me know.

madisongh commented 4 years ago

The setup is a bit simpler than it was when Pablo wrote that blog post, I guess. You don't have to explicitly include the -container-csv pacakages - they'll be pulled in automatically by the packages they're associated with. So something like this:

IMAGE_INSTALL_append = " nvidia-docker cudnn tensorrt libvisionworks libvisionworks-sfm libvisionworks-tracking cuda-libraries"

should work better. The exact set of packages you need to add will depend on which NGC container you intend to run.

mjemv commented 4 years ago

With these, I am getting fetch errors with a lot of cuda packages with release, seems like the path has been moved https://repo.download.nvidia.com/jetson/common/pool/main/c/cuda/ does not exist.

I am using the branch dunfell-l4t-r32.4.3

ERROR: cuda-cuobjdump-10.2.89-1-r0 do_fetch: Fetcher failure for URL: 'https://repo.download.nvidia.com/jetson/common/pool/main/c/cuda/cuda-cuobjdump-10-0_10.2.89-1_arm64.deb;name=main;subdir=cuda-cuobjdump-10.2.89-1'

Can you please check

madisongh commented 4 years ago

I saw the same, but now it's working for me. I suspect this was a problem at NVIDIA's end, probably due to their pushing out a new release. Give it another try.

mjemv commented 4 years ago

should i set CUDA_VERSION to 10.2?

mjemv commented 4 years ago

So I was able to build. But do nvidia packages take so much space build_tegra$ ls tmp/deploy/images/jetson-nano-qspi-sd/ -hl | grep flash -rw-r--r-- 1 ubuntu ubuntu 655M Oct 24 14:05 core-image-minimal-jetson-nano-qspi-sd-20201024081103.tegraflash.tar.gz -rw-r--r-- 2 ubuntu ubuntu 83M Oct 24 14:36 core-image-minimal-jetson-nano-qspi-sd-20201024090633.tegraflash.tar.gz

The later is the one without IMAGE_INSTALL_append = " nvidia-docker cudnn tensorrt libvisionworks libvisionworks-sfm libvisionworks-tracking cuda-libraries"

madisongh commented 4 years ago

should i set CUDA_VERSION to 10.2?

You shouldn't have to. It will get set automatically.

But do nvidia packages take so much space

Yes, they run quite large.

mjemv commented 4 years ago

Getting error with libnvinfer

https://forums.developer.nvidia.com/t/imagenet-error-while-loading-shared-libraries-usr-lib-aarch64-linux-gnu-libnvinfer-so-7-file-too-short/158543

0;root@1fce794aad39: /jetson-inference/build/aarch64/binroot@1fce794aad39:/jetson-inference/build/aarch64/bin# ls -l /usr/lib/aarch64-linux-gnu/libnvinfer* lrwxrwxrwx 1 root root 19 Oct 27 19:46 /usr/lib/aarch64-linux-gnu/libnvinfer.so -> libnvinfer.so.7.1.3 lrwxrwxrwx 1 root root 19 Oct 27 19:46 /usr/lib/aarch64-linux-gnu/libnvinfer.so.7 -> libnvinfer.so.7.1.3 -rw-r--r-- 1 root root 0 Jul 1 20:05 /usr/lib/aarch64-linux-gnu/libnvinfer.so.7.1.3 lrwxrwxrwx 1 root root 26 Oct 27 19:46 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so -> libnvinfer_plugin.so.7.1.3 lrwxrwxrwx 1 root root 26 Oct 27 19:46 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.7 -> libnvinfer_plugin.so.7.1.3 -rw-r--r-- 1 root root 0 Jul 1 20:05 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.7.1.3

madisongh commented 4 years ago

The zero length for the libraries certainly doesn't look right.

What does ls /usr/lib/libnvinfer* look like outside the container? Are the libraries and symlinks present? Since there isn't any documentation with that container on what its specific dependencies are, you're going to have to track them down yourself and ensure that all of the packages (including -dev packages) are installed in your image. Enabling debug logging for the container runtime could help with this - try uncommenting the debug lines in /etc/nvidia-container-runtime/config.toml and see if the log files it generates helps with identifying missing mappings.

mjemv commented 4 years ago

Is there a container which you are able to run on yocto which uses nvidia gpu, I am badly trying to find a sample to test to check that my headless yocto can be used for some kind of object recognition / AI.

madisongh commented 4 years ago

The ones I typically test with are the L4T-Base and DeepStream-L4T containers from NVIDIA's NGC catalog, running them in a demo-image-full image built from our reference distro.

dwalkes commented 4 years ago

It's not what you are asking for but we do have some example containers using L4T for image sensor access inside a container at https://gitlab.com/boulderai/bai-edge-sdk in case that's useful. We don't have object recognition/AI examples yet but plan to add those in the future.

mjemv commented 4 years ago

I was trying to run l4t-tensorflow with demo-image-full, but am getting the error

root@jetson-nano-qspi-sd:~# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf1.15-py3

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused " process_linux.go:413: running prestart hook 0 caused \ "error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/c221fd0c4c06bd899c4650f35d4578556a5500fff952a3956b954948ab1dae27/merged/etc/vulkan/icd.d/nvidia_icd.json: file exists\n\""": unknown.

any idea what could be wrong here ? Is there a way to enable more verbose logging

mjemv commented 4 years ago

I was trying to run l4t-tensorflow with demo-image-full, but am getting the error

root@jetson-nano-qspi-sd:~# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf1.15-py3

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused " process_linux.go:413: running prestart hook 0 caused "error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/c221fd0c4c06bd899c4650f35d4578556a5500fff952a3956b954948ab1dae27/merged/etc/vulkan/icd.d/nvidia_icd.json: file exists\n""": unknown.

any idea what could be wrong here ? Is there a way to enable more verbose logging

Surprisingly, If i build the same model container from source it works docker images REPOSITORY TAG IMAGE ID CREATED SIZE l4t-tensorflow r32.4.3-tf2.2-py3 07efcbb28832 51 minutes ago 2.45GB l4t-tensorflow r32.4.3-tf1.15-py3 f1f1704425aa 2 hours ago 2.09GB

aae1c98335f1 2 hours ago 2.17GB nvcr.io/nvidia/l4t-base r32.4.3 c93fc89026d9 4 months ago 631MB root@jetson-nano-qspi-sd:~# nvidia-docker run -it l4t-tensorflow:r32.4.3-tf1.5-py3 Unable to find image 'l4t-tensorflow:r32.4.3-tf1.5-py3' locally docker: Error response from daemon: pull access denied for l4t-tensorflow, repository does not exist or may require 'docker login': denied: re. See 'docker run --help'. root@jetson-nano-qspi-sd:~# ^C root@jetson-nano-qspi-sd:~# nvidia-docker run -it l4t-tensorflow:r32.4.3-tf1.15-py3 ... 0;root@de1c662da643: /root@de1c662da643:/# cat test.py import tensorflow as tf print('Num GPUs Available: ', len(tf.config.experimental.list_physical_devices('GPU'))) 0;root@de1c662da643: /root@de1c662da643:/# python3 test.py 2020-11-05 10:31:26.751191: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.12 WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. 2020-11-05 10:31:36.239577: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-11-05 10:31:36.251839: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero 2020-11-05 10:31:36.251990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216 pciBusID: 0000:00:00.0 2020-11-05 10:31:36.252069: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.12 2020-11-05 10:31:36.255843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2020-11-05 10:31:36.258868: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2020-11-05 10:31:36.259881: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2020-11-05 10:31:36.266370: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so0 2020-11-05 10:31:36.269457: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so0 2020-11-05 10:31:36.270180: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8 2020-11-05 10:31:36.270393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero 2020-11-05 10:31:36.270597: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] ARM64 does not support NUMA - returning NUMA node zero 2020-11-05 10:31:36.270678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0 Num GPUs Available: 1 what could change when building locally vs running the stock ngc model ?
madisongh commented 4 years ago

what could change when building locally vs running the stock ngc model ?

The filesystem layout is different between stock L4T and OE/Yocto builds, and the error you're seeing when using the problematic containers is due to L4T using symlink for /etc/vulkan/icd.d/nvidia_icd.json whereas in our builds we just drop the actual JSON file in that location. The containers are also including the L4T-style symlink for some reason, rather than taking advantage of the runtime's automatic passthrough, and that symlink conflicts with the actual file we're passing through when the overlay filesystem is being composed for the container.

I think the least awful fix for this is to match the symlink for that file in our builds make it compatible. That will add a /usr/lib/aarch64-linux-gnu/tegra directory in the root filesystem, which is kind of ugly, but it appears the Vulkan loader only looks for the JSON file in /etc/vulkan/icd.d can't handle zero-length JSON files, which would be present in the container if we were to relocate our copy of the file to a different directory in its search path.

There's a similar issue with the libglvnd config files, but that library searches for its config files in multiple locations, and we install our config in a different place than L4T does, so there's no conflict - just some symlinks and 0-length files visible in the container that don't cause any errors.