OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
409 stars 225 forks source link

MB2 missing partition table #1255

Closed oscarthorn closed 1 year ago

oscarthorn commented 1 year ago

Hi!

I'm not sure if this is a bug with meta-tegra or something hardware related but you seem quite knowledgeable, so I figured you might have an idea. Or know if it is a bug with meta-tegra.

We have some xavier nx devices running dunfell and meta-tegra @ fd63b94. Some of them, 4 so far, have suddenly stopped booting. As far as we can tell they have been running fine and then suddenly on reboot we get the included error log and they wont start. We tried flashing one of them again and it has been working fine for weeks now, so it does not seem that there is any permanent hardware damage/fault to explain it.

We have now idea what's causing it so it is hard to reproduce, and we are wondering how it can happen at all? It seems this is referring to the QSPI flash but our assumption was that it would be mostly read-only? Since it is four devices (out of 50) it does not seem to be a one off fluke.

[0000.438] I> Welcome to MB2(TBoot-BPMP) (version: 00.00.2018.32-mobile-6fc80c72)

[...]

[0000.473] I> Boot-device: QSPI

[0000.476] I> Boot_device: QSPI_FLASH instance: 0

[0000.481] I> QSPI Flash Size = 32 MB

[0000.486] I> Qspi initialized successfully

[0000.487] I> qspi flash-0 params source = boot args

[0000.493] W> Cannot find any partition table for 00030000

[0000.498] I> Active Boot chain : 0

[0000.501] E> Cannot find partition bpmp-fw

[0000.505] E> Partition bpmp-fw not found

[0000.508] I> load/auth: execution failed

[0000.512] E> Top caller module: LOADER, error module: PARTITION_MANAGER, reason: 0x0d, aux_info: 0x00

[0000.521] I> AB warm reset
ichergui commented 1 year ago

Hey @oscarthorn Are you sure that the flashing process went well ? Please share the host and target (serial uarl) logs.

I'm seeing the following logs:

[0000.501] E> Cannot find partition bpmp-fw
[0000.505] E> Partition bpmp-fw not found
[0000.508] I> load/auth: execution failed

BPMP is a key component. Also, Is secure boot enabled with your device ?

oscarthorn commented 1 year ago

Hi!

Thanks for the response!

Yes, to clarify, this specific device had been running without issue for several weeks. So unfortunately I can't share the flashing logs, we don't have them saved since there did not seem to be an issue with the device.

Yes, we have both secure boot and encryption as well as a/b partitions enabled.

ichergui commented 1 year ago

Please make sure that you are using the right keys SBK and PKC I don't have secureboot enabled with my Jetson Xavier NX but I will try he branch you mentioned to double everything is working as expected.

ichergui commented 1 year ago

@oscarthorn Is this a new hardware module ? if so, please check the FAB and BoardSKU

oscarthorn commented 1 year ago

Would the keys being wrong not manifest immediately on first boot? How do I verify the keys on a system like this that does not boot, can I read it out somehow? I think the keys are correct, according to our logs it was flashed with the correct keys (and flashing it again is fine).

Thanks, though I'm not sure you will get any error, we have several dozen more units that all work fine and even the faulty units have worked fine after flashing and only latter ended up in this state, after 1-8 weeks of use roughly.

I don't think so, I would have to check which one it is exactly but it would be one of these. The board is another developer right now so I'll ask him to check.

TEGRA_BUPGEN_SPECS ?= " \
                fab=100;boardsku=0000;boardrev= \
                fab=200;boardsku=0000;boardrev= \
                fab=300;boardsku=0000;boardrev= \
                fab=301;boardsku=0000;boardrev= \
                fab=100;boardsku=0001;boardrev= \
                fab=200;boardsku=0001;boardrev= \
                fab=300;boardsku=0001;boardrev= \
                fab=301;boardsku=0001;boardrev= \
                fab=200;boardsku=0003;boardrev= \
                fab=300;boardsku=0003;boardrev= \
                fab=301;boardsku=0003;boardrev= \
"
ichergui commented 1 year ago

I don't think so, I would have to check which one it is exactly but it would be one of these. The board is another developer right now so I'll ask him to check.

TEGRA_BUPGEN_SPECS ?= " \
                fab=100;boardsku=0000;boardrev= \
                fab=200;boardsku=0000;boardrev= \
                fab=300;boardsku=0000;boardrev= \
                fab=301;boardsku=0000;boardrev= \
                fab=100;boardsku=0001;boardrev= \
                fab=200;boardsku=0001;boardrev= \
                fab=300;boardsku=0001;boardrev= \
                fab=301;boardsku=0001;boardrev= \
                fab=200;boardsku=0003;boardrev= \
                fab=300;boardsku=0003;boardrev= \
                fab=301;boardsku=0003;boardrev= \
"

Yes, please do and let me know

oscarthorn commented 1 year ago

Thanks, will do! I'll get back in a couple of days, he was not at home (we have a long weekend in sweden)

oscarthorn commented 1 year ago

@ichergui This is the boardspec for one of the faulty modules: 3668-301-0003-B.0-1-2

madisongh commented 1 year ago

@oscarthorn Did you figure this out?

oscarthorn commented 1 year ago

@madisongh Yes, turns out it was this issue: https://github.com/OE4T/tegra-boot-tools/issues/20. At least we think that's the cause, a bit hard to be 100% sure. We are only using an m2 ssd and have disabled to emmc, so it was falling back to qpsi for boot related storage. We have updated tegra-boot-tools and are hoping that solves the issue.