Closed rbuchmann77 closed 1 year ago
Which of the Jetsons are you using? If it's the TX2, which bootloader?
We are using TX2NX. We use the default bootloader: cboot.
OK. And you're saying that while tegra-boot-control -s
reports slot 0 as current, the kernel command line is showing boot.slot_suffix=_b
(or vice-versa, slot 1 as current, and null string as the slot suffix)? Or is it just your mounted rootfs is on the wrong partition, and if so, how is that rootfs partition determined during boot? Does your kernel command line have an explicit root=
setting?
The default setup here, for cboot-only (no U-boot) builds, should have no root=
setting in the kernel command line, and the initrd should be locating the rootfs based on the boot.slot_suffix
kernel arg (appending the value of that to APP
, then locating the partition with the resulting name). So the rootfs location should always be derived from the boot slot reported by cboot. Since cboot adds the boot.slot_suffix
kernel arg based on the current boot slot,
Are you using the default setup from the layer, or a custom setup of some kind?
I asked about the bootloader because the U-Boot case works differently, making it easier for the boot slot and rootfs selection to get out of sync.
I would say: When tegra-boot-control -s reports slot 0, we got boot.slot_suffix=. The files I checked where come from slot 1. I guess mounting point was wrong but I did not think about this checking.
One sample: console=ttyS0,115200 console=tty0 fbcon=map:0 isolcpus=1-2 rootwait usbcore.autosuspend=-1 ro video=tegrafb earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x1772e0000 gpt rootfs.slot_suffix= tegra_fbmem=0x1c0000@0x96088000 lut_mem=0x2008@0x96084000 usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 no_console_suspend boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x175840000 sdhci_tegra.en_boot_part_access=1 systemd.machine_id=9fa6f1e781f649019827180de5b897e8
So we have no root= setting.
We are using meta-tegra, meta-tegra-community and our custom layer on top of this. There is no changes about cboot or bootloader, except a few variables like rootwait and usbcore.
OK, then you should zero in on the logic in the initrd for choosing the partition to mount as the rootfs. If you have a serial console attached, you could add set -x
to the init script in the initrd (e.g., the init-boot.sh
script installed by this recipe unless you have a custom one in your layer). Capture the serial console output to a file through your multiple reboots, and when the problem occurs, take a look to see what happened to cause the incorrect mount.
Hi,
Thank you for the reply. I got logs from serial console and I guess it fails to boot on B so, after multiple retries, it boot on A.
[0000.136] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-e75fdd51)
[0000.145] I> Boot-device: eMMC
[0000.152] I> sdmmc bdev is already initialized
[0000.157] I> pmic: reset reason (nverc) : 0x0
[0000.190] I> Found 18 partitions in SDMMC_BOOT (instance 3)
[0000.210] I> Found 34 partitions in SDMMC_USER (instance 3)
[0000.216] I> A/B: bin_type (16) slot 1
[0000.219] I> Loading partition bpmp-fw_b at 0xd7800000
[0000.224] I> Reading two headers - addr:0xd7800000 blocks:1
[0000.230] I> Addr: 0xd7800000, start-block: 13294784, num_blocks: 1
[0000.255] I> Binary(16) of size 534416 is loaded @ 0xd7800000
[0000.260] I> A/B: bin_type (17) slot 1
[0000.264] I> Loading partition bpmp-fw-dtb_b at 0xd79f0000
[0000.269] I> Reading two headers - addr:0xd79f0000 blocks:1
[0000.275] I> Addr: 0xd79f0000, start-block: 13298040, num_blocks: 1
[0000.294] I> Binary(17) of size 113200 is loaded @ 0xd79e4400
[0000.322] I> Loading SCE-FW ...
[0000.325] I> A/B: bin_type (12) slot 1
[0000.329] I> Loading partition sce-fw_b at 0xd7300000
[0000.334] I> Reading two headers - addr:0xd7300000 blocks:1
[0000.339] I> Addr: 0xd7300000, start-block: 13304184, num_blocks: 1
[0000.358] I> Binary(12) of size 125632 is loaded @ 0xd7300000
[0000.364] I> Init SCE
[0000.366] I> Loading APE-FW ...
[0000.369] I> A/B: bin_type (11) slot 1
[0000.373] I> Loading partition adsp-fw_b at 0xd7400000
[0000.378] I> Reading two headers - addr:0xd7400000 blocks:1
[0000.383] I> Addr: 0xd7400000, start-block: 13285384, num_blocks: 1
[0000.402] I> Binary(11) of size 77216 is loaded @ 0xd7400000
[0000.408] I> Copy BTCM section
[0000.411] I> A/B: bin_type (13) slot 1
[0000.415] I> Loading partition cpu-bootloader_b at 0x96000000
[0000.420] I> Reading two headers - addr:0x96000000 blocks:1
[0000.426] I> Addr: 0x96000000, start-block: 13257736, num_blocks: 1
[0000.447] I> Binary(13) of size 319920 is loaded @ 0x96000000
[0000.453] I> A/B: bin_type (20) slot 1
[0000.456] I> Loading partition bootloader-dtb_b at 0x85205400
[0000.462] I> Reading two headers - addr:0x85205400 blocks:1
[0000.467] I> Addr: 0x85205400, start-block: 13259784, num_blocks: 1
[0000.487] I> Binary(20) of size 195712 is loaded @ 0x85205400
[0000.493] I> A/B: bin_type (14) slot 1
[0000.496] I> Loading partition secure-os_b at 0x85305600
[0000.501] I> Reading two headers - addr:0x85305600 blocks:1
[0000.507] I> Addr: 0x85305600, start-block: 13266952, num_blocks: 1
[0000.529] I> Binary(14) of size 402864 is loaded @ 0x85305600
[0000.537] I> TOS boot-params @ 0x85000000
[0000.541] I> TOS params prepared
[0000.544] I> Loading EKS ...
[0000.546] I> A/B: bin_type (15) slot 0
[0000.550] I> Loading partition eks at 0x85905800
[0000.554] I> Reading two headers - addr:0x85905800 blocks:1
[0000.560] I> Addr: 0x85905800, start-block: 13273096, num_blocks: 1
[0000.579] I> Binary(15) of size 1040 is loaded @ 0x85905800
[0000.584] I> EKB detected (length: 0x400) @ 0x85905800
[0000.589] I> Copied encrypted keys
[0000.593] I> boot profiler @ 0x175844000
[0000.597] I> boot profiler for TOS @ 0x175844000
[0000.602] I> Unhalting SCE
[0000.604] I> Primary Memory Start:80000000 Size:70000000
[0000.609] I> Extended Memory Start:f0110000 Size:856f0000
[0000.616] I> MB2(TBoot-BPMP) done
Is that correct?
That doesn't look like a failover; if the bootloaders decide to fail over, you should see messages about the slot switch happening, and all of the binaries should get loaded out of the other slot. It's definitely odd that nvtboot is loading the slot A EKB, though (I'm assuming you have a populated eks_b
partition)... can you share more of the console logs?
Hi,
Slot B was valid and working.
When the problem occurs (eq. system booted on slot A instead of slot B), I do the following: 1) I do no changes on rootfs A or B. 2) I run "reboot" 3) Next reboot, the system booted on slot B as expected.
I did the following test to get more logs: 1) I select slot A using tegra-boot-control. 2) I reboot 100 times: the problem did not happen. 3) I select slot B using tegra-boot-control 4) I reboot 21 times and the problem occured.
Here test scripts:
while ssh tx2-yocto mount | grep " / " | grep "0p1 " ; do ssh tx2-yocto systemctl reboot -ff ; sleep 30 ; done
while ssh tx2-yocto mount | grep " / " | grep "0p33 " ; do ssh tx2-yocto systemctl reboot -ff ; sleep 30 ; done
In the following logs, there are several boots. Only the last one is wrong.
Unfortunately, the critical part of the log that would show that that last one is wrong is missing. The final kernel command line is correct (boot.slot_suffx=_b
), there are some early kernel logs, then ...
and a login prompt. The Mounting /dev/XXXXX
that is printed out by the init script in the initrd wasn't included. So it's still not clear what the actual problem is here.
Another bug seems to be related. When the slot problem occurs, mender install fails. If I reboot (so boot on the right slot as expected), mender install succeed.
Logs of mender install:
mount: /tmp/nvbup.C71Hre: /dev/mmcblk0p1 already mounted on /.
ERR: Missing /opt/ota_package directory in installed rootfs
---------- end of script output
ERRO[0078] ArtifactInstall_Leave script failed: statescript: error executing 'ArtifactInstall_Leave_80_bl-update': 1 : exit status 1
Rolling back Artifact...
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary
INFO[0078] Rolling back to the active partition: (1).
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary
ERRO[0078] statescript: error executing 'ArtifactInstall_Leave_80_bl-update': 1 : exit status 1
I will provide more logs on the main problem tomorrow.
mount: /tmp/nvbup.C71Hre: /dev/mmcblk0p1 already mounted on /.
ERR: Missing /opt/ota_package directory in installed rootfs
I think that's the expected failure if your mender uboot vars are out of sync with reality. See https://github.com/OE4T/tegra-demo-distro/issues/244#issuecomment-1402683719
Here logs of the last occurence.
That's a good capture, thanks. From this rather scrambled-looking output
_open: B[ 8.196377] mmcblk0boot0: mmc0:0001 DG4016 partition 1 4.00 MiB
ad file [ 8.200488] mmcblk0boot1: mmc0:0001 DG4016 partition 2 4.00 MiB
descript[ 8.204585] mmcblk0rpmb: mmc0:0001 DG4016 partition 3 4.00 MiB
or
gpt_init: No such file or direct[ 8.230187] mmcblk1: mmc1:0001 DG4008 7.28 GiB
ory
[ 8.232838] mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 p26 p27 p28 p29 p30 p31 p32 p33 p34
it looks like we have a race between the init script executing and the kernel finishing its probing of the block storage devices. That's something we would not see in a stock configuration, but it looks like you have an SDcard on mmc1, and a USB flash drive as well, which is adding some delay to the probe completion.
So my guess is that the init script in the initrd is running before the mmcblk0, mmcblk0boot0, and mmcblk0boot1 devices are ready for use, once in a while. That's would explain why the bootcountcheck
script issued that (broken up) error message, and is likely the reason why the slot check is failing, causing the fallback to the mmcblk0p1
default. Customizing the initrd script to wait until those devices are present before sourcing the platform-preboot
script should take care of the problem.
You are right: I have both addionnal eMMC and a USB stick. You gave a very useful hint to solve this issue.
Thank you Madisongh. This analysis is very clear.
Wrong slot A/B This bug is random at a low occurence (i don't know the rate). Sometimes we boot on slot A but the rootfs is from the slot B (and vice versa).
The following commands display wrong informations: tegra-boot-control -s cat /proc/cmdline
In other words, it tells we booted on slot A as expected but... the rootfs is from the slot B. If we reboot (with no changes), we boot on slot A as expected and the rootfs is correct.
Impacted versions:
To Reproduce Steps to reproduce the behavior: