OE4T / tegra-demo-distro

Reference/demonstration distro for meta-tegra
MIT License
77 stars 74 forks source link

random build failures on kirkstone with custom TX2 hardware #261

Closed sarnold closed 1 year ago

sarnold commented 1 year ago

We have ported the vc_mipi_camera bits for this to the current meta-tegra kernel and latest kirkstone branch => kirkstone-l4t-r32.7.x

The test device has 2 cameras on the CapableRobots baseboard, where the yocto machine is jetson-xavier-nx-devkit-tx2-nx, and most of the time the demo image bits build fine, except when they don't:

| aarch64-oe4t-linux-gcc-ar -rv egldevice/libnvgldemo.a egldevice/nvgldemo_main.o
| r - egldevice/nvgldemo_main.o
| aarch64-oe4t-linux-gcc-ar -rv egldevice/libnvgldemo.a egldevice/nvgldemo_parse.o
| r - egldevice/nvgldemo_parse.o
| aarch64-oe4t-linux-gcc-ar -rv egldevice/libnvgldemo.a egldevice/nvgldemo_shader.o
| aarch64-oe4t-linux-gcc-ar -rv egldevice/libnvgldemo.a egldevice/nvgldemo_texture.o
| r - egldevice/nvgldemo_shader.o
| aarch64-oe4t-linux-gcc-ar -rv egldevice/libnvgldemo.a egldevice/nvgldemo_socket.o
| r - egldevice/nvgldemo_socket.o
| /home/nerdboy/my_stuff/home/hardware/tegra-demo-distro/build/tmp/work/armv8a_tegra-oe4t-linux/l4t-graphics-demos/32.7.3-20221122092958-r0/recipe-sysroot-native/usr/bin/aarch64-oe4t-linux/../../libexec/aarch64-oe4t-linux/gcc/aarch64-oe4t-linux/11.3.0/ar: egldevice/libnvgldemo.a: error reading nvgldemo_shader.o: file truncated
| r - egldevice/nvgldemo_texture.o
| make[1]: *** [<builtin>: egldevice/libnvgldemo.a(egldevice/nvgldemo_socket.o)] Error 1
| make[1]: *** Waiting for unfinished jobs....
| rm egldevice/nvgldemo_os_posix.o egldevice/nvgldemo_texture.o egldevice/nvgldemo_cqueue.o egldevice/nvgldemo_preswap.o egldevice/nvgldemo_main.o egldevice/nvgldemo_shader.o egldevice/nvgldemo_win_egldevice.o egldevice/nvgldemo_parse.o egldevice/nvgldemo_socket.o egldevice/nvgldemo_math.o
| make[1]: Leaving directory '/home/nerdboy/my_stuff/home/hardware/tegra-demo-distro/build/tmp/work/armv8a_tegra-oe4t-linux/l4t-graphics-demos/32.7.3-20221122092958-r0/l4t-graphics-demos/usr/src/nvidia/graphics_demos/nvgldemo'
| make: *** [Makefile:118: ../nvgldemo/egldevice/libnvgldemo.a] Error 2
| make: *** Waiting for unfinished jobs....
| make: Leaving directory '/home/nerdboy/my_stuff/home/hardware/tegra-demo-distro/build/tmp/work/armv8a_tegra-oe4t-linux/l4t-graphics-demos/32.7.3-20221122092958-r0/l4t-graphics-demos/usr/src/nvidia/graphics_demos/ctree'
| ERROR: oe_runmake failed

Once that ^^ happens no amount of cleaning/rebuiilding seems to help and I have no idea what the root cause is. Help??

sarnold commented 1 year ago

After moving that ^^ build dir out of the way, then doing setup-env and copying local.conf/bblayers.conf to the new/clean build dir, tried building the same image again and it works fine. => demo-image-sato

dwalkes commented 1 year ago

@sarnold these are typically the types of errors you see when you run out of memory on your build host in my experience. Doing a clean on the specific image which is failing typically resolves.

sarnold commented 1 year ago

Sorry, i thought that was more clear - I couldn't clean my way out of that one, only a fresh build dir helped.

madisongh commented 1 year ago

It sounds environmental... what are you using for a build host (OS, CPU count, RAM size) and for storage of your build tree (device type, filesystem type, size/occupied/free)? Are there any kernel errors logged when you run into the problem?

sarnold commented 1 year ago

Build host is intel core-i7 with 8 cpu threads and 32 GB ram plus ~130 GB free (for tegra) on nvme ssd. I have several build trees for rpi, pine64, rockchip, marvell and I don't normally have issues; most of it is "bare" user account, but xilinx needs to build in a VM running bionic as project is stuck on 2020.x release of petalinux.

$ free
               total        used        free      shared  buff/cache   available
Mem:        32732296     4214848     2811928         348    25705520    28041948
Swap:        4194300       41472     4152828

and

$ df -h
Filesystem                                       Size  Used Avail Use% Mounted on
devtmpfs                                          10M     0   10M   0% /dev
tmpfs                                             16G   68K   16G   1% /dev/shm
tmpfs                                             16G  708K   16G   1% /run
/dev/nvme0n1p3                                   464G  345G   96G  79% /
cgroup_root                                       10M     0   10M   0% /sys/fs/cgroup
/dev/nvme0n1p1                                   973M  383M  536M  42% /boot
/dev/sda1                                        110G  612M  104G   1% /public
tmpfs                                            3.2G  5.7M  3.2G   1% /run/user/1000

The above is post-error on the graphics demos but still building rust-native.

madisongh commented 1 year ago

The RAM and free disk space should be enough, I think, although more free disk space would probably be safer. No kernel messages about either OOM or filesystem errors on whichever drive you're running the build from? Does the filesystem ever hit usage of 90% or more?

The Rust compiler build is pretty resource-intensive, and CUDA application builds are, too. I wouldn't expect the graphics demos to be that bad, though - no CUDA there, just EGL.

ichergui commented 1 year ago

Hey @sarnold Any updates about this issue ?

sarnold commented 1 year ago

Sorry, got busy with xilinx again and we already gave the tegra customer a workable meta layer for his camera stuff. I suspect the cause of our build issue was most likely too much distro feature hacking, as I ended up going with less of that and a really light openbox config which seemed stable enough for dev work.

ichergui commented 1 year ago

I will close this ticket. You can reopen new one if you have any issue.