Open acostach opened 2 years ago
That it's triggering an exception like that is probably a bug - maybe not properly handling the case where no boot device is found? Considering how complicated the t19x boot sequence has become as they've tacked on support for NVME, extlinux, etc., CBO, etc., it wouldn't surprise me that some infrequently-exercised code path is broken.
Still, bugs happen, and cboot should have a watchdog timer enabled by default, unless you've configured otherwise. I don't remember the default timeout off the top of my head, but it's on the order of minutes.
Thanks Matt, I left it for about 5 minutes and nothing happened but I'll reproduce this again and leave it for longer. I was reading in the docs that the PMIC WDT is disabled by default from odmdata on the Xavier NX and that's why I was asking. I'll get back with an update.
@madisongh Unfortunately the device did not restart after 40 minutes in this state, perhaps that watchdog is not enabled by default?
Sure enough, the default ODMDATA (0xB8190000
) disables the PMIC WDT by default. I thought they at least enabled the internal WDT in the processor, but perhaps not. Try flashing with ODMDATA set to 0xB81A0000
, that should enable the PMIC WDT, if I'm reading the cboot sources correctly.
Hi @madisongh , unfortunately the PMIC WDT doesn't appear to work with systemd, the device reboots as soon as it finishes booting. We're using systemd to kick the watchdog with RuntimeWatchdogSec set to 10s. This doesn't happen with PMIC WDT disabled in the previous ODMDATA though.
Ah, right, you may need to modify your device tree and/or kernel config and/or systemd configuration to switch to using the PMIC WDT from Linux, too. I believe the driver is enabled by default in the default kernel config (CONFIG_MAX77620_WATCHDOG), but it may not be enabled by default in the device tree. Even if it is, if there are multiple watchdog devices, systemd uses /dev/watchdog0
by default, which could be the wrong one.
@madisongh yes, the config is enabled and from what I see the plugin manager also enables it in the device tree based on the selected odmdata.
There are no other watchdog devices created (only watchdog and watchdog0) and if I remove the watchdog.conf file systemd no longer opens the device and the board doesn't reboot. If however i do a cat on /dev/watchdog the board is rebooted immediately.
Now that you mention it, I vaguely remember running into a similar problem on the TX2 a while back. IIRC there was a bug somewhere that kept the SoC internal WDT around even when I chose the PMIC WDT. Looking at my kernel config, I think the solution was to just disable the support for the other WDTs there - here's my pmic-watchdog-only.cfg
fragment:
# CONFIG_TEGRA21X_WATCHDOG is not set
# CONFIG_TEGRA18X_WATCHDOG is not set
# CONFIG_SOFT_PLATFORM_WATCHDOG is not set
CONFIG_MAX77620_WATCHDOG=y
I don't remember now whether the bug was in cboot or the kernel, but there could be something similar going on here.
Many thanks @madisongh! indeed, disabling the other tegra watchdog modules made the max77620-watchdog work with systemd.
Hi @madisongh, unfortunately enabling the PMIC WDT did not solve the original cboot problem, I managed to reproduce the issue once more and left the board running for 30 minutes, the pmic did not reset it though.
The easiest way to reproduce the watchdog not resetting the device is to flash the NX SD with that odmdata and then remove the sd-card or any other medium from which it could boot before powering it on.
I guess that maybe cboot doesn't actually start this watchdog?
I see it's overriding some nodes in the dtb but a not sure if they have anything to do with this:
[0001.653] I> Plugin-manager override starting
[0001.658] I> node /plugin-manager/fragment-pcie-c5-rp matches
[0001.666] I> node /plugin-manager/fragement-pmic-wdt-en matches
[0001.670] I> node /plugin-manager/fragement-tegra-wdt-dis matches
[0001.676] I> node /plugin-manager/fragement-tegra-sdhci-emmc-dis matches
[0001.684] I> Disable plugin-manager status in FDT
[0001.686] I> Plugin-manager override finished successfully
Hmm. I took a closer look at the cboot code, and sure enough, it disables the WDT as part of the kernel handoff. I don't know if that's a recent change, or whether I was misremembering how it worked, but that surprised me. :(
Also, it turns out that cboot always uses the internal WDT, even if you've chosen the PMIC WDT via ODMDATA.
Some additional patches to cboot will be needed to addresses these shortcomings.
Thanks again for looking into this @madisongh ! I see that tegrabl_reset() makes the boot sequence restart from MB1, at least on the NX devkit, and am wondering if resetting instead of halting is a good idea. I'll try to reproduce the problem to see if cboot is able to load the kernel and dtb from the raw partitions after reset with this:
+++ b/bootloader/partner/t18x/cboot/platform/tegra_shared/debug.c
@@ -15,6 +15,7 @@
#include <platform_c.h>
#include <printf.h>
#include <tegrabl_timer.h>
+#include <tegrabl_exit.h>
#include <tegrabl_debug.h>
#if defined(CONFIG_DEBUG_TIMESTAMP)
@@ -91,7 +92,9 @@ int platform_dgetc(char *c, bool wait)
void platform_halt(void)
{
- dprintf(ALWAYS, "HALT: spinning forever...\n");
+ dprintf(ALWAYS, "Will reset in 10 seconds...\n");
+ tegrabl_mdelay(10 * 1000);
+ tegrabl_reset();
Yes, looks like doing the tegrabl_reset() after the panic caused by sd-card partitions read failure makes the nx boot normally:
[0006.571] x24 0x 0 x25 0x 0 x26 0x 0 x27 0x 0
[0006.580] x28 0x 0 x29 0x a06a8160 lr 0x a060f7f4 sp 0x a06a7f30
[0006.589] elr 0x 0
[0006.592] spsr 0x 400003c9
[0006.595] -----------------------------------------------
[0006.600] panic (caller 0xa0601238): die
[0006.604] Will reset in 10 seconds...
[0016.607] E> tegrabl_display_shutdown: display is not initialized
����Shutdown state requested 1
Rebooting system ...
��
[0000.024] W> RATCHET: MB1 binary ratchet value 4 is too large than ratchet level 2 from HW fuses.
[0000.033] I> MB1 (prd-version: 1.5.1.7-t194-41334769-98030a79)
[0000.038] I> Boot-mode: Coldboot
[0000.041] I> Chip revision : A02P
[0000.044] I> Bootrom patch version : 15 (correctly patched)
[0000.049] I> ATE fuse revision : 0x200
[0000.053] I> Ram repair fuse : 0x0
[0000.056] I> Ram Code : 0x0
[0000.058] I> rst_source : 0xb
[0000.061] I> rst_level : 0x1
[0000.065] I> Boot-device: QSPI
[0000.067] I> Qspi flash params source = brbct
@acostach Great, that looks like a much simpler way to solve the problem you're seeing. Please leave this issue open, though, as I will try and solve the problems with cboot's WDT handling, when I get a chance.
Not sure if this has been happening with older L4Ts but I've noticed this sporadic panic in cboot 32.6.1, happens occasionally after rebooting the device multiple times:
I guess the sd-card is sometimes in a bad state so I'm wondering if it would it be possible to configure cboot to simply reboot in this case instead of halting? Thank you