OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
394 stars 220 forks source link

Wrong slot A/B #1142

Closed rbuchmann77 closed 1 year ago

rbuchmann77 commented 1 year ago

Wrong slot A/B This bug is random at a low occurence (i don't know the rate). Sometimes we boot on slot A but the rootfs is from the slot B (and vice versa).

The following commands display wrong informations: tegra-boot-control -s cat /proc/cmdline

In other words, it tells we booted on slot A as expected but... the rootfs is from the slot B. If we reboot (with no changes), we boot on slot A as expected and the rootfs is correct.

Impacted versions:

To Reproduce Steps to reproduce the behavior:

  1. Install 1 different version for the 2 slots
  2. Boot
  3. Check a file that is different for the 2 versions.
  4. Repeat steps 2-3 until the bug appears.
madisongh commented 1 year ago

Which of the Jetsons are you using? If it's the TX2, which bootloader?

rbuchmann77 commented 1 year ago

We are using TX2NX. We use the default bootloader: cboot.

madisongh commented 1 year ago

OK. And you're saying that while tegra-boot-control -s reports slot 0 as current, the kernel command line is showing boot.slot_suffix=_b (or vice-versa, slot 1 as current, and null string as the slot suffix)? Or is it just your mounted rootfs is on the wrong partition, and if so, how is that rootfs partition determined during boot? Does your kernel command line have an explicit root= setting?

The default setup here, for cboot-only (no U-boot) builds, should have no root= setting in the kernel command line, and the initrd should be locating the rootfs based on the boot.slot_suffix kernel arg (appending the value of that to APP, then locating the partition with the resulting name). So the rootfs location should always be derived from the boot slot reported by cboot. Since cboot adds the boot.slot_suffix kernel arg based on the current boot slot,

Are you using the default setup from the layer, or a custom setup of some kind?

I asked about the bootloader because the U-Boot case works differently, making it easier for the boot slot and rootfs selection to get out of sync.

rbuchmann77 commented 1 year ago

I would say: When tegra-boot-control -s reports slot 0, we got boot.slot_suffix=. The files I checked where come from slot 1. I guess mounting point was wrong but I did not think about this checking.

One sample: console=ttyS0,115200 console=tty0 fbcon=map:0 isolcpus=1-2 rootwait usbcore.autosuspend=-1 ro video=tegrafb earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x1772e0000 gpt rootfs.slot_suffix= tegra_fbmem=0x1c0000@0x96088000 lut_mem=0x2008@0x96084000 usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 no_console_suspend boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x175840000 sdhci_tegra.en_boot_part_access=1 systemd.machine_id=9fa6f1e781f649019827180de5b897e8

So we have no root= setting.

We are using meta-tegra, meta-tegra-community and our custom layer on top of this. There is no changes about cboot or bootloader, except a few variables like rootwait and usbcore.

madisongh commented 1 year ago

OK, then you should zero in on the logic in the initrd for choosing the partition to mount as the rootfs. If you have a serial console attached, you could add set -x to the init script in the initrd (e.g., the init-boot.sh script installed by this recipe unless you have a custom one in your layer). Capture the serial console output to a file through your multiple reboots, and when the problem occurs, take a look to see what happened to cause the incorrect mount.

rbuchmann77 commented 1 year ago

Hi,

Thank you for the reply. I got logs from serial console and I guess it fails to boot on B so, after multiple retries, it boot on A.

[0000.136] I> Welcome to MB2(TBoot-BPMP)(version: 01.00.160913-t186-M-00.00-mobile-e75fdd51)
[0000.145] I> Boot-device: eMMC
[0000.152] I> sdmmc bdev is already initialized
[0000.157] I> pmic: reset reason (nverc)        : 0x0
[0000.190] I> Found 18 partitions in SDMMC_BOOT (instance 3)
[0000.210] I> Found 34 partitions in SDMMC_USER (instance 3)
[0000.216] I> A/B: bin_type (16) slot 1
[0000.219] I> Loading partition bpmp-fw_b at 0xd7800000
[0000.224] I> Reading two headers - addr:0xd7800000 blocks:1
[0000.230] I> Addr: 0xd7800000, start-block: 13294784, num_blocks: 1
[0000.255] I> Binary(16) of size 534416 is loaded @ 0xd7800000
[0000.260] I> A/B: bin_type (17) slot 1
[0000.264] I> Loading partition bpmp-fw-dtb_b at 0xd79f0000
[0000.269] I> Reading two headers - addr:0xd79f0000 blocks:1
[0000.275] I> Addr: 0xd79f0000, start-block: 13298040, num_blocks: 1
[0000.294] I> Binary(17) of size 113200 is loaded @ 0xd79e4400
[0000.322] I> Loading SCE-FW ...
[0000.325] I> A/B: bin_type (12) slot 1
[0000.329] I> Loading partition sce-fw_b at 0xd7300000
[0000.334] I> Reading two headers - addr:0xd7300000 blocks:1
[0000.339] I> Addr: 0xd7300000, start-block: 13304184, num_blocks: 1
[0000.358] I> Binary(12) of size 125632 is loaded @ 0xd7300000
[0000.364] I> Init SCE
[0000.366] I> Loading APE-FW ...
[0000.369] I> A/B: bin_type (11) slot 1
[0000.373] I> Loading partition adsp-fw_b at 0xd7400000
[0000.378] I> Reading two headers - addr:0xd7400000 blocks:1
[0000.383] I> Addr: 0xd7400000, start-block: 13285384, num_blocks: 1
[0000.402] I> Binary(11) of size 77216 is loaded @ 0xd7400000
[0000.408] I> Copy BTCM section
[0000.411] I> A/B: bin_type (13) slot 1
[0000.415] I> Loading partition cpu-bootloader_b at 0x96000000
[0000.420] I> Reading two headers - addr:0x96000000 blocks:1
[0000.426] I> Addr: 0x96000000, start-block: 13257736, num_blocks: 1
[0000.447] I> Binary(13) of size 319920 is loaded @ 0x96000000
[0000.453] I> A/B: bin_type (20) slot 1
[0000.456] I> Loading partition bootloader-dtb_b at 0x85205400
[0000.462] I> Reading two headers - addr:0x85205400 blocks:1
[0000.467] I> Addr: 0x85205400, start-block: 13259784, num_blocks: 1
[0000.487] I> Binary(20) of size 195712 is loaded @ 0x85205400
[0000.493] I> A/B: bin_type (14) slot 1
[0000.496] I> Loading partition secure-os_b at 0x85305600
[0000.501] I> Reading two headers - addr:0x85305600 blocks:1
[0000.507] I> Addr: 0x85305600, start-block: 13266952, num_blocks: 1
[0000.529] I> Binary(14) of size 402864 is loaded @ 0x85305600
[0000.537] I> TOS boot-params @ 0x85000000
[0000.541] I> TOS params prepared
[0000.544] I> Loading EKS ...
[0000.546] I> A/B: bin_type (15) slot 0
[0000.550] I> Loading partition eks at 0x85905800
[0000.554] I> Reading two headers - addr:0x85905800 blocks:1
[0000.560] I> Addr: 0x85905800, start-block: 13273096, num_blocks: 1
[0000.579] I> Binary(15) of size 1040 is loaded @ 0x85905800
[0000.584] I> EKB detected (length: 0x400) @ 0x85905800
[0000.589] I> Copied encrypted keys
[0000.593] I> boot profiler @ 0x175844000
[0000.597] I> boot profiler for TOS @ 0x175844000
[0000.602] I> Unhalting SCE
[0000.604] I> Primary Memory Start:80000000 Size:70000000
[0000.609] I> Extended Memory Start:f0110000 Size:856f0000
[0000.616] I> MB2(TBoot-BPMP) done

Is that correct?

madisongh commented 1 year ago

That doesn't look like a failover; if the bootloaders decide to fail over, you should see messages about the slot switch happening, and all of the binaries should get loaded out of the other slot. It's definitely odd that nvtboot is loading the slot A EKB, though (I'm assuming you have a populated eks_b partition)... can you share more of the console logs?

rbuchmann77 commented 1 year ago

Hi,

Slot B was valid and working.

When the problem occurs (eq. system booted on slot A instead of slot B), I do the following: 1) I do no changes on rootfs A or B. 2) I run "reboot" 3) Next reboot, the system booted on slot B as expected.

I did the following test to get more logs: 1) I select slot A using tegra-boot-control. 2) I reboot 100 times: the problem did not happen. 3) I select slot B using tegra-boot-control 4) I reboot 21 times and the problem occured.

Here test scripts:

while ssh tx2-yocto mount | grep " / " | grep "0p1 " ; do ssh tx2-yocto systemctl reboot -ff ; sleep 30 ; done
while ssh tx2-yocto mount | grep " / " | grep "0p33 " ; do ssh tx2-yocto systemctl reboot -ff ; sleep 30 ; done

In the following logs, there are several boots. Only the last one is wrong.

rbuchmann77 commented 1 year ago

minicom_slotA_instead_of_B.log

madisongh commented 1 year ago

Unfortunately, the critical part of the log that would show that that last one is wrong is missing. The final kernel command line is correct (boot.slot_suffx=_b), there are some early kernel logs, then ... and a login prompt. The Mounting /dev/XXXXX that is printed out by the init script in the initrd wasn't included. So it's still not clear what the actual problem is here.

rbuchmann77 commented 1 year ago

Another bug seems to be related. When the slot problem occurs, mender install fails. If I reboot (so boot on the right slot as expected), mender install succeed.

Logs of mender install:

mount: /tmp/nvbup.C71Hre: /dev/mmcblk0p1 already mounted on /.
ERR: Missing /opt/ota_package directory in installed rootfs

---------- end of script output 
ERRO[0078] ArtifactInstall_Leave script failed: statescript: error executing 'ArtifactInstall_Leave_80_bl-update': 1 : exit status 1 
Rolling back Artifact...
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary 
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary 
INFO[0078] Rolling back to the active partition: (1).   
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary 
INFO[0078] Output (stderr) from command "fw_printenv": ERR: no such variable: mender_check_saveenv_canary 
ERRO[0078] statescript: error executing 'ArtifactInstall_Leave_80_bl-update': 1 : exit status 1 

I will provide more logs on the main problem tomorrow.

dwalkes commented 1 year ago
mount: /tmp/nvbup.C71Hre: /dev/mmcblk0p1 already mounted on /.
ERR: Missing /opt/ota_package directory in installed rootfs

I think that's the expected failure if your mender uboot vars are out of sync with reality. See https://github.com/OE4T/tegra-demo-distro/issues/244#issuecomment-1402683719

rbuchmann77 commented 1 year ago

minicom.2023.01.31.log

Here logs of the last occurence.

madisongh commented 1 year ago

That's a good capture, thanks. From this rather scrambled-looking output

_open: B[    8.196377] mmcblk0boot0: mmc0:0001 DG4016 partition 1 4.00 MiB
ad file [    8.200488] mmcblk0boot1: mmc0:0001 DG4016 partition 2 4.00 MiB
descript[    8.204585] mmcblk0rpmb: mmc0:0001 DG4016 partition 3 4.00 MiB
or
gpt_init: No such file or direct[    8.230187] mmcblk1: mmc1:0001 DG4008 7.28 GiB 
ory
[    8.232838]  mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 p26 p27 p28 p29 p30 p31 p32 p33 p34

it looks like we have a race between the init script executing and the kernel finishing its probing of the block storage devices. That's something we would not see in a stock configuration, but it looks like you have an SDcard on mmc1, and a USB flash drive as well, which is adding some delay to the probe completion.

So my guess is that the init script in the initrd is running before the mmcblk0, mmcblk0boot0, and mmcblk0boot1 devices are ready for use, once in a while. That's would explain why the bootcountcheck script issued that (broken up) error message, and is likely the reason why the slot check is failing, causing the fallback to the mmcblk0p1 default. Customizing the initrd script to wait until those devices are present before sourcing the platform-preboot script should take care of the problem.

rbuchmann77 commented 1 year ago

You are right: I have both addionnal eMMC and a USB stick. You gave a very useful hint to solve this issue.

Thank you Madisongh. This analysis is very clear.