espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.7k stars 7.3k forks source link

Add the possibility to register a custom panic handler callback to intercept app crashes (IDFGH-3146) #5163

Closed leofabri closed 4 years ago

leofabri commented 4 years ago

Feature request:

I wish I could set a custom panic handler callback from my main application in order to get some additional info and intercept crashes from my main component.

What I'd like to obtain

I'm not sure if there is another and better way to do what I'm trying to achieve. I'm currently working on esp-idf v4.0 and it looks like that this feature is still missing even in the future releases currently under development.

Applying these changes to each update is time-consuming, and maybe other users could benefit from this. I would love to have it out of the box.

My current solution. Tested and working

I have edited the esp32 and freertos components in order to get the described feature, however, I wish you could consider adding something like this as part of the framework.


Changes (additions) made to esp-idf v4.0 to get the feature in question:

Notice: the links pointing to the files I've changed are an indicative reference of where the change was applied.

esp-idf/components/esp32/Kconfig:

config ESP32_PANIC_CALLBACK
        bool "Support registration of a user defined callback for the panic handler"
        default y
        help
            Use xt_set_error_handler_callback() to register a custom callback.
            The callback is called by the common error handler so catches exceptions,
            panics and abort() calls.

esp-idf/components/esp32/panic.c:

#if CONFIG_ESP32_PANIC_CALLBACK
/*
* Custom error handler callback registration.
*/
xt_error_handler_callback customErrorHandler = NULL;
xt_error_handler_callback xt_set_error_handler_callback(xt_error_handler_callback f)
{
  xt_error_handler_callback old = customErrorHandler;
  customErrorHandler = f;
  return old;
}
#endif //CONFIG_ESP32_PANIC_CALLBACK

esp-idf/components/esp32/panic.c commonErrorHandler():

    #if CONFIG_ESP32_PANIC_CALLBACK
    if (customErrorHandler) {
        disableAllWdts();
        customErrorHandler(frame, core_id, abort_called);
        reconfigureAllWdts();
    }
    #endif

components/freertos/include/freertos/xtensa_api.h:

/* Typedef for C-callable error handler callback function */
typedef void (*xt_error_handler_callback)(XtExcFrame *, int core_id, bool is_abort);

/*
-------------------------------------------------------------------------------
Call this function to set a callback for the standard error handler.
The callback will be called by the commonErrorHandler on all errors.

    f        - Callback function address, NULL to uninstall callback.

The callback will be passed a pointer to the exception frame, which is created
on the stack of the thread that caused the exception, the core id and
a bool to signal if abort() has been called.

The callback is called with watchdogs disabled.
-------------------------------------------------------------------------------
*/
extern xt_error_handler_callback xt_set_error_handler_callback(xt_error_handler_callback f);

Thank you in advance.

Alvin1Zhang commented 4 years ago

@leofabri Thanks for raising the feature request, we will evaluate.

igrr commented 4 years ago

@leofabri Is it important to you at which stage in the panic handler your callback is executed? For example, if we add the custom handler at the same stage as GDB Stub / Core dump, would that work for you?

The other thing, I wouldn't suggest running the callback with watchdogs disabled, since at this point the watchdog remains the only thing that prevents us from getting stuck inside the panic handler. We also need to add some logic for not re-entering the custom callback if it crashes and we re-enter the panic handler.

Finally, if we implement this change, we will only do this on the current master branch, where the panic handler code is quite different from v4.0. We usually don't backport such features to older releases. If you do need the feature specifically on v4.0, I'm afraid you will have to keep using your local patch.

leofabri commented 4 years ago

Hello @igrr first of all, thank you for getting back to me. I'm currently using this patch but I wasn't the one that first thought of it; therefore, the code I showed above is basically the original one and may not be perfect.

Regarding the things you pointed out:

  1. I think that adding the custom error handler between GDB Stub and Core dump would be totally fine because I don't need to do anything with a higher priority.

My intention is to use a callback like this:

void Boot::MyErrorCallback(XtExcFrame *frame, int core_id, bool is_abort)
{
  boot_data.crash_data.core_id = core_id;
  boot_data.crash_data.is_abort = is_abort;

  // Save registers:
  for (int i = 0; i < 24; i++)
    boot_data.crash_data.reg[i] = ((uint32_t *)frame)[i + 1];

// Save backtrace:
// (see panic.c::doBacktrace() for code template)
#define _adjusted_pc(pc) (((pc)&0x80000000) ? (((pc)&0x3fffffff) | 0x40000000) : (pc))
  uint32_t i = 0, pc = frame->pc, sp = frame->a1;
  boot_data.crash_data.bt[0].pc = _adjusted_pc(pc);
  pc = frame->a0;
  while (++i < APP_BT_LEVELS)
  {
    uint32_t psp = sp;
    if (!esp_stack_ptr_is_sane(sp))
      break;
    sp = *((uint32_t *)(sp - 0x10 + 4));
    boot_data.crash_data.bt[i].pc = _adjusted_pc(pc - 3);
    pc = *((uint32_t *)(psp - 0x10));
    if (pc < 0x40000000)
      break;
  }
}

  1. I do agree that disabling the watchdogs could potentially break the entire execution and I think that having a system that is reliable has the highest priority here. Maybe disabling them could be added as a "dangerous" option but at the current state of the art, I doubt I would want to do that. Adding some more logic to avoid re-entering the callback was something I didn't think of, maybe we should flag it as already executed so that it doesn't get called again.

  2. Regarding the fact of not adding this feature to the version of ESP-IDF I'm working with, I think it's obviously right. I referenced this version because it's the one I'm currently working with and I wasn't sure if other development versions have something different in them. After all, this is a new feature, so I'll be probably updating it to a development branch if I want to have it. It would be great to have in future releases.
igrr commented 4 years ago

Overall, if you had a way to add custom data to Core Dumps, would that satisfy your use case? Because Core Dumps can actually be uploaded to remote server after restart, and used to diagnose crashes.

leofabri commented 4 years ago

Overall, if you had a way to add custom data to Core Dumps, would that satisfy your use case? Because Core Dumps can actually be uploaded to remote server after restart, and used to diagnose crashes.

Core Dumps are definitely the solution to this, and adding some more data to it could actually satisfy my current use case. I think though that it would also be interesting to have that custom callback to handle crashes with a custom logic.

igrr commented 4 years ago

Yes, will look into adding custom callback as an option. I would still prefer to understand what requirements this satisfies, because if we understand the requirements we would rather prefer to implement the missing features in Core dump / GDB stub. Regarding adding custom logic, this is currently possible without modifications to IDF code, using linker wrapping feature. For instance, Memfault SDK uses it to override IDF Core dump implementation with their custom core dumps: https://github.com/memfault/memfault-firmware-sdk/blob/70f2d4ed6884d907d4b39da99434b762209ba568/ports/esp_idf/memfault/CMakeLists.txt#L79.

leofabri commented 4 years ago

Yes, I understand. I'm probably asking for a feature that I'm not going to be using immediately, therefore I'm unable to give specific requirements. Thank you so much for your availability and support. I'm going to use the linker feature to get what I need.

dexterbg commented 4 years ago

@igrr This feature was actually invented by me for the OVMS project: https://github.com/openvehicles/esp-idf/commit/daef4b5c11a646b7149bf3534e338f3070ae3abf

I think it was included in a pull request of ours at that time, but can't find that now. Sorry if we missed that.

It's meant to provide crash reasons and backtraces from user devices in the field. Uploading core dumps is not an option as most users are on mobile plans. My code copies just the essential crash info into RTC RAM. After reboot the info is sent as a crash debug record to the OVMS server. We keep the .elf files for all user builds and can feed the backtrace to gdb/addr2line to get an idea what happened.

See the whole OVMS crash debug code here: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/main/ovms_boot.cpp

I agree the current implementation may miss handling some edge cases, but it's been working very well for us and continues to provide crash backtraces (we're still on our idf 3 fork). We're now beginning migration to v4, so if you could add this feature officially to v4, that would be very appreciated.

Thanks & regards, Michael

igrr commented 4 years ago

Hi Michael, thanks for the extra background! I think for us it is much easier to support a concrete story of saving minimal information about the crash than to support arbitrary user code to be run from the panic handler. We already have an internal feature request for saving this additional minimal description of the crash, for the same reason as you mention (limiting data traffic to the cloud). It will likely be implemented in IDF 4.3, since 4.2 is now in feature freeze. Arbitrary user callbacks are still possible through the linker wrap feature, and I think IDF 4.2 makes overriding the panic handling logic somewhat easier by separating the target-specific code from the common one.

RoderickJDunn commented 4 years ago

@igrr Hi, I'm also interested in this feature (saving a minimal description of a crash, so it can be uploaded after reset). Is there a public issue where I can track its progress? If it's still only being tracked internally, do you have an update on when it will be available? Thanks

igrr commented 4 years ago

@RoderickJDunn it is indeed only tracked internally at the moment. I see some progress being made, and very likely the implementation will be available publicly before the end of the year. If you create a new issue here on Github for this topic, I will then link it internally so that you get notified once the feature is published.

crisreimberg commented 3 years ago

@igrr Is the feature of "save the crash info to be send after the reset" available? I am trying to implement this function using an ESP32 but I couldn't do it yet. Thanks.

igrr commented 3 years ago

@crisreimberg The feature to obtain the summary from the core dump is implemented, you need to call esp_core_dump_image_get on startup to check if the core dump image exists (optionally after checking the reset reason), and then call esp_core_dump_get_summary to obtain the summary information structure.

https://github.com/espressif/esp-idf/blob/c13afea635adec735435961270d0894ff46eef85/components/espcoredump/include/esp_core_dump.h#L140-L147

chipweinberger commented 2 years ago

This is the first google result for "esp32 custom panic handler".

After all the other panic handlers run, I want to:

It seems this is still not easily possible? The memfault trick might cover the LED and piezo, but it's pretty involved to have to create a new component with special linking.

edit (March 2024): I just run all my "panic" code after reboot. Simpler. You can still access the panic info

dizcza commented 1 year ago

This is the first google result for "esp32 custom panic handler".

+1

I'd like to turn the heater off if something wrong happens. As simple as GPIO set 0.

ftab commented 1 year ago

Similarly, I have a custom board which needs to be shut off in a certain sequence. Namely, an I2C command, a 50ms wait, and a few GPIOs enabled/disabled. Otherwise, there is a very loud pop when the panic handler resets. A way to add code to the panic handler would be great for this.

chipweinberger commented 1 year ago

I ended up doing my "panic" code at boot time, depending on the esp_reset_reason.

might be a viable option for some.

boarchuz commented 1 year ago

It only takes a couple lines to do this with linker wrapping. IDF even includes a test case that does this exact thing, for reference:

Custom handler: https://github.com/espressif/esp-idf/blob/3befd5fff72aa6980514454a50233037718b611f/tools/test_apps/system/panic/main/panic_utils/memprot_panic_utils_xtensa.c#L47 And this line in CMakeLists: https://github.com/espressif/esp-idf/blob/3befd5fff72aa6980514454a50233037718b611f/tools/test_apps/system/panic/CMakeLists.txt#L13

nicklasb commented 1 year ago

@igrr I am not sure what happened to this issues, it seems like a bot made a decision to not to it? :-)

To add a possibility to intervene after a panic is essential to smoothly handling error states, like reporting problems and dynamically resolve them.

Another example, instead of having a reset, going through a short deep sleep does the same thing and allows you easily store information using RTC_DATA_ATTR to orderly avoid ending up in the same situation again.

In many situations it is way better to get stuck in a panic handler than in a constant reset loop.

igrr commented 1 year ago

it seems like a bot made a decision to not to it? :-)

The bot simply syncs some properties of the issue from our internal issue tracker. In this case I've marked the issue as "won't do" there and the bot has applied the label here on Github.

I didn't mean it as: "we will never support custom panic handler callbacks", though. I meant that for the requirements of the author of the issue, I think the custom panic handler callback is not the right solution: it would be more complex, and the use case is common enough anyway to not require a custom callback. We have implemented a different solution: as part of another ticket we have added an API to extract information from core dumps, and closed this one as "won't do".

If you can describe your use case in more detail, I can give feedback whether we would likely support it via custom panic handler callbacks or not. Just a note, entering deep sleep, despite the seemingly simple behavior, ends up using a fair number of system-level features. Because of this, I don't think that calling esp_deep_sleep_start from the panic handler is safe, in general.

In the meantime, if you feel confident about it, the comment just above shows a very simple way of overriding the panic handler.

nicklasb commented 1 year ago

@igrr I suspected the bot wasn't that smart. :-)

Basically it comes down to me using the ESP32 more as a "real" computer or server than an MCU. I am developing a framework on top of ESP-IDF (and Arduino) that provides a lot of abstraction for communication, error reporting and other things, so if something goes wrong in all that code, which it does, a reset is a very unproductive and uniforming way to handle the situation. Imagine if your Linux machine only did that.

I have code that can recover from many of these states, and if a process (ok, task) crashes, there is no reason to reset the whole thing down as a first measure. Sure, at some point a reset might be necessary, but then in a controlled manner.

To me, resetting using deep sleep actually is pretty sufficient, and especially since I get many of the nice things of the reset but gets to keep information that enables the unit to get going again quickly. The reason for the problem is likely to be in my code rather than the system level features anyway, :-)

WRT the comment above, it is not a very portable solution.

And to be quite honest i am not sure why one would not want to have the possibility of a custom error handler in a computer system? It is such an enabling feature?

igrr commented 1 year ago

Thanks for explaining your use case!

To me, resetting using deep sleep actually is pretty sufficient, and especially since I get many of the nice things of the reset but gets to keep information that enables the unit to get going again quickly.

Assuming you keep the state in RTC memory, this is possible with a reset (not just deep sleep) as well, if you use RTC_NOINIT_ATTR.

WRT the comment above, it is not a very portable solution.

Could you please expand on this a bit? By portable do you mean "portable to another RTOS"? Seems like the panic handler callback wouldn't be portable, either?

And to be quite honest i am not sure why one would not want to have the possibility of a custom error handler in a computer system?

Panic handling is quite a delicate task. Many aspects of the system can not be used or relied upon: heap allocation, C library functions, RTOS functions. Therefore documenting the exact rules for panic handler callbacks — which IDF APIs are allowed to be called and which aren't — is not an easy task. Troubleshooting panic handler issues is often quite tricky, and potential bugs where the panic handler constantly re-enters and never resets the chip are one of the worst classes of bugs to have in a consumer product. Basically, adding this feature is simple but providing support to customers who decide to use it could be quite resource intensive for us.

For users who "know what they are doing" and are okay with taking responsibility for any possible issues there's a way to wrap the panic handler, mentioned above. For example, here's an example of a similar approach in MemFault SDK.

nicklasb commented 1 year ago

Assuming you keep the state in RTC memory, this is possible with a reset (not just deep sleep) as well, if you use RTC_NOINIT_ATTR.

Argh! I mixed those up, and used the RTC_DATA_ATTR instead! I thought it was so weird that it didn't work for me any longer as I had it working before.

A note about that is that the documentation around RTC_DATA_ATTR is a bit confusing, would be better if it was more specific and not so much about stubs and so on. It doesn't say that this data is reset where that information is expected. In myworld, that would be the most usual use case for the RTC memory. And when googling on RTC memory I ended up there.

Could you please expand on this a bit? By portable do you mean "portable to another RTOS"? Seems like the panic handler callback wouldn't be portable, either?

If ESP-IDF had a simple callback, the framework would just need code for that functionality and wouldn't have to require users to create specialized project/linker level settings to get its error handling. Basically the suggested solution makes 3rd party libraries using the callback much more difficult to use and spread.

Panic handling is quite a delicate task. ....Therefore documenting the exact rules for panic handler callbacks — which IDF > APIs are allowed to be called and which aren't — is not an easy task.

Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there. And if you find out that something works reliably you can document that later. Using hardware the way it is not supposed to be used has lead to a ton of great stuff, the entire 64 demo scene and almost all its games, for example.

norventa commented 7 months ago

My use case is thus: a gpio pin turns a relay on/off. The relay turns a kiln on/off. If the kiln is on and a panic happens I would like to insert code to turn the kiln off for safety reasons.

chipweinberger commented 7 months ago

simple solution: turn the gpio off after the device reboots.

I run all my panic handler code after reboot.

norventa commented 7 months ago

@chipweinberger Thanks for that. The problem I have is when I run in debug and there is no reboot. My app runs for upto 12 hours and I am not in attendance all that time. I have a very intermittent bug that I want to catch and I can't use debug for this reason.

nicklasb commented 7 months ago

simple solution Adding an effing callback is a pretty simple solution as well. :-)

It creates a lot of other opportunities, like creating higher-level frameworks that automatically reports issues, which is what I am doing.

TBH, I have not seen much in the way of strong arguments against it here, rather alternative solutions to specific problems, and that is another discussion, helpful as it may be.

igrr commented 7 months ago

@nicklasb I am sorry, I just realized I haven't replied to your previous comment.

If ESP-IDF had a simple callback, the framework would just need code for that functionality and wouldn't have to require users to create specialized project/linker level settings to get its error handling.

Link-time wrapping can be completely encapsulated in your custom component/library, I don't think anything would have to be done by the application developer. Basically, you can specify the additional linker arguments using a target_link_options call in your component CMakeLists file. All the user will have to do is to add your component to their project.

Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there.

How useful would such a restrictive callback be for your use case, and would it enable anything that Core Dump currently doesn't? As I mentioned above to the OP, we would very much rather add the missing data to the Core Dump, and then you can handle the crash after a restart in the sane execution environment of app_main.

(I understand that shutting down GPIOs or peripherals on reset is indeed something we don't support. We'll look into adding an option to do that.)

Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there.

I understand your point. Perhaps not to the point of sueing, but we did face issues with customers' use of dangerous features in the past, hence my hesitation.

norventa commented 7 months ago

@igrr Being sued isn't top of my avoidance list it is being responsible for someone getting hurt. What is the arguement against making the function panic_abort's linkage weak?

igrr commented 7 months ago

Lately we have been trying to avoid weak linkage as the method for overriding functions, where possible. The argument is that when two components try to override the same function using weak linkage, the result is unpredictable. When using callback registration, we can at least return an error (indicating that a callback is already registered) or return the previous callback, allowing the components to "chain" the calls.

I would still love to know what functionality/information you are missing in core dumps to implement your error reporting features based on that...

nicklasb commented 7 months ago

Both an error and a especially a chain solves the problem elegantly (I edited for some reason I didn't see the entire answer before, so I suggested just that).

I understand that you feel you lose control a bit there, but and link-time wrapping is less flexible and more difficult, we the users also want some control. Also, it is not as easy to see that there is a linkage like that for the casual developer. Please with sugar on top. :-)