Closed arrow53 closed 3 years ago
Do you have a serial console attached to the device? Capturing the serial console output (and saving the boot logs/journal) when this occurs should help in figuring out what's going on.
Also, which branch are you using?
@madisongh I do have console. Would dmesg
suffice or should I just capture everything that goes to console from power on? I can also provide any other specific files if you can think of what would help. I'll do this next time it happens.
@arrow53 in case it's helpful there are some tests at https://github.com/mendersoftware/meta-mender-community/tree/dunfell/meta-mender-tegra/scripts/test which I've run successfully on Xavier NX in the past, these do 100 mender updates, verifyng partition swaps between them.
Thanks @dwalkes I'll take a look. That lends some confidence that I'm doing something dumb. Appreciate that you've done that level of testing.
To get messages from the bootloader(s), you need to capture everything from power-on. If you're logging OS startup messages to the serial console, that's good too - you'd be looking for the time where either a bootloader failed to boot the OS, or there was an OS startup failure before the system completely booted.
as best as I can tell, my image just got too big. the actual mmc partition didn't run out of room, but when I removed dev-pkgs
from my image the upgrade start working again. I'm going to close this and I'll comment later if I discover anything definitiive.
@madisongh
rather than dumping the whole boot output, here is a selection showing before and after
[0000.085] I> Active Boot chain : 0
...
[0002.261] I> first bootslot SLOT A:
[0002.261] I> bootslot full_suffix false and slot is A
[0002.266] I> Active slot suffix:
[0002.269] I> bootslot error:
[0002.272] I> boot-order :-
[0000.085] I> Active Boot chain : 1
...
[0002.274] I> first bootslot SLOT B:
[0002.274] I> bootslot SLOT B:
[0002.276] I> Active slot suffix: _b
[0002.279] I> bootslot error: _b
[0002.282] I> boot-order :-
If I pop out the SD card I can see differences in the filesystem. The upgrade only seems to affect one partition and regardless of what I do with this nvbootctrl set-active-boot-slot
I always see the same filesystem.
At the moment it seems no matter what I do it's stuck pointing to /dev/mmcblk0p11
and never uses /dev/mmcblk0p1
If I'm on B and I issue a mender -install
I see
INFO[0000] Mender running on partition: /dev/mmcblk0p1
INFO[0000] Opening device "/dev/mmcblk0p11" for writing
But, I'm pretty sure it should be the other way around. If I can figure out what controls this maybe I can look there to see what is wrong?
I don't recognize some of those cboot messages, and can't find them in the code - do you have the right cboot? It should be reporting Cboot Version: 32.04.04-oe4t-t194-583676d8
when it starts. Despite the odd messages, it looks like cboot is identifying the slots correctly. What does the kernel command line look like? You should see a boot.slot_suffix=
entry in it.
Here's an example, showing cboot loading the kernel and DTB from the corresponding slot-suffixed partition and showing the command line with the slot suffix.
[0004.126] I> ########## SD boot ##########
[0004.126] I> Found sdcard
[0004.128] I> regulator 'vdd-sdmmc1-sw' already enabled
[0004.131] I> regulator 'vdd-sdmmc1-sw' already enabled
[0004.158] I> sdmmc SDR mode
[0004.173] I> -0 params source =
[0004.173] I> Already published: 00060000
[0004.173] I> Look for boot partition
[0004.173] I> Fallback: assuming 0th partition is boot partition
[0004.174] I> Detect filesystem
[0004.194] I> Loading extlinux.conf ...
[0004.195] I> rootfs path: /sd/boot/extlinux/extlinux.conf
[0008.332] I> lookup_linear_dir:441: Invalid file block num
[0008.333] I> ext2_walk:142: 'extlinux' lookup failed
[0008.333] I> ext4_open_file:647: '/boot/extlinux/extlinux.conf' lookup failed
[0008.334] E> file /sd/boot/extlinux/extlinux.conf open failed!!
[0008.334] E> Failed to find/load /boot/extlinux/extlinux.conf
[0008.336] I> Fallback: Load binaries from partition
[0008.341] I> Active slot suffix: _b
[0008.344] I> Loading kernel_b ...
[0013.805] I> Loading kernel-dtb_b ...
[0013.844] I> Validate kernel ...
[0013.845] I> T19x: Authenticate kernel (bin_type: 37), max size 0x5000000
[0014.241] I> Validate kernel-dtb ...
[0014.242] I> T19x: Authenticate kernel-dtb (bin_type: 38), max size 0x400000
[0014.245] I> Checking boot.img header magic ... [0014.245] I> [OK]
[0014.245] I> Kernel hdr @0xa42b0000
[0014.246] I> Kernel dtb @0x90000000
[0014.246] I> decompressor handler not found
[0014.246] I> Copying kernel image (40024072 bytes) from 0xa42b0800 to 0x80080000 ... [0014.260] I> Done
[0014.261] I> Move ramdisk (len: 0) from 0xa68dc800 to 0x91000000
[0014.262] I> Updated bpmp info to DTB
[0014.263] I> Ramdisk: Base: 0x91000000; Size: 0x0
[0014.267] I> Updated initrd info to DTB
[0014.270] E> tegrabl_linuxboot_add_disp_param, du 0 failed to get display params
[0014.278] E> tegrabl_linuxboot_add_disp_param, du 0 failed to get display params
[0014.285] E> tegrabl_linuxboot_add_disp_param, du 0 failed to get display params
[0014.292] I> Active slot suffix: _b
[0014.295] I> add_boot_slot_suffix: slot_suffix = _b
[0014.301] I> Linux Cmdline: console=ttyTCU0,115200 console=tty0 fbcon=map:0 video=tegrafb no_console_suspend=1 earlycon=tegra_comb_uart,mmio32,0x0c168000 gpt usbcore.old_scheme_first=1 tegraid=19.1.2.0.0 maxcpus=6 boot.slot_suffix=_b boot.ratchetvalues=0.4.2 vpr_resize sdhci_tegra.en_boot_part_access=1
[0014.327] I> Updated bootarg info to DTB
One more thing... do you have a custom initramfs, by any chance? The logic for extracting the boot slot suffix from the kernel command line (which cboot is responsible for adding), and mounting the rootfs based on that, is in the init scripts (provided by tegra-initrdscripts
) that go into the tegra-minimal-initramfs
image. If you have a custom initramfs, it, too, needs to do the same thing.
@madisongh sorry, I added in some comments to cboot in order to try to get a better idea what it was doing.
he logic for extracting the boot slot suffix from the kernel command line (which cboot is responsible for adding)
Oh, man, that may be it. I haven't mucked with initframfs directory. But, I have a cboot/extlinux.conf setup where I load different dtb files based on camera discovery in cboot. Within my /boot/extlinux/extlinux.conf
I have
APPEND ${cbootargs} quiet root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyTCU0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0
this is probably overwriting the root partition name.
I probably just copied this line when I first made my extlinux file from an example and forgot about it. Maybe I can just remove this whole line from the extelinux.conf
selections?
this is probably overwriting the root partition name.
It is indeed.
I probably just copied this line when I first made my extlinux file from an example and forgot about it. Maybe I can just remove this whole line from the extelinux.conf selections?
Yes.
@arrow53 Did removing the root setting from your extlinux.conf file fix the problem?
oh, i'm sorry. I forgot to close this. Yes it did.
I'm using this repo with a Xavier NX.
I'd say maybe 1 out of 5-10 times my
mender -install
update won't switch partitions. I can pop out my SD card and I can see the rootfs A/B images and I can tell they are different as I keep a file that holds the version of my image in the filesystem.Even manually changing the slots doesn't seem to do anything
nvbootctrl set-active-boot-slot
Any tips on how to debug this? Or, things I could be doing incorrectly?