Closed leofabri closed 4 years ago
@leofabri Thanks for raising the feature request, we will evaluate.
@leofabri Is it important to you at which stage in the panic handler your callback is executed? For example, if we add the custom handler at the same stage as GDB Stub / Core dump, would that work for you?
The other thing, I wouldn't suggest running the callback with watchdogs disabled, since at this point the watchdog remains the only thing that prevents us from getting stuck inside the panic handler. We also need to add some logic for not re-entering the custom callback if it crashes and we re-enter the panic handler.
Finally, if we implement this change, we will only do this on the current master branch, where the panic handler code is quite different from v4.0. We usually don't backport such features to older releases. If you do need the feature specifically on v4.0, I'm afraid you will have to keep using your local patch.
Hello @igrr first of all, thank you for getting back to me. I'm currently using this patch but I wasn't the one that first thought of it; therefore, the code I showed above is basically the original one and may not be perfect.
Regarding the things you pointed out:
void Boot::MyErrorCallback(XtExcFrame *frame, int core_id, bool is_abort)
{
boot_data.crash_data.core_id = core_id;
boot_data.crash_data.is_abort = is_abort;
// Save registers:
for (int i = 0; i < 24; i++)
boot_data.crash_data.reg[i] = ((uint32_t *)frame)[i + 1];
// Save backtrace:
// (see panic.c::doBacktrace() for code template)
#define _adjusted_pc(pc) (((pc)&0x80000000) ? (((pc)&0x3fffffff) | 0x40000000) : (pc))
uint32_t i = 0, pc = frame->pc, sp = frame->a1;
boot_data.crash_data.bt[0].pc = _adjusted_pc(pc);
pc = frame->a0;
while (++i < APP_BT_LEVELS)
{
uint32_t psp = sp;
if (!esp_stack_ptr_is_sane(sp))
break;
sp = *((uint32_t *)(sp - 0x10 + 4));
boot_data.crash_data.bt[i].pc = _adjusted_pc(pc - 3);
pc = *((uint32_t *)(psp - 0x10));
if (pc < 0x40000000)
break;
}
}
Overall, if you had a way to add custom data to Core Dumps, would that satisfy your use case? Because Core Dumps can actually be uploaded to remote server after restart, and used to diagnose crashes.
Overall, if you had a way to add custom data to Core Dumps, would that satisfy your use case? Because Core Dumps can actually be uploaded to remote server after restart, and used to diagnose crashes.
Core Dumps are definitely the solution to this, and adding some more data to it could actually satisfy my current use case. I think though that it would also be interesting to have that custom callback to handle crashes with a custom logic.
Yes, will look into adding custom callback as an option. I would still prefer to understand what requirements this satisfies, because if we understand the requirements we would rather prefer to implement the missing features in Core dump / GDB stub. Regarding adding custom logic, this is currently possible without modifications to IDF code, using linker wrapping feature. For instance, Memfault SDK uses it to override IDF Core dump implementation with their custom core dumps: https://github.com/memfault/memfault-firmware-sdk/blob/70f2d4ed6884d907d4b39da99434b762209ba568/ports/esp_idf/memfault/CMakeLists.txt#L79.
Yes, I understand. I'm probably asking for a feature that I'm not going to be using immediately, therefore I'm unable to give specific requirements. Thank you so much for your availability and support. I'm going to use the linker feature to get what I need.
@igrr This feature was actually invented by me for the OVMS project: https://github.com/openvehicles/esp-idf/commit/daef4b5c11a646b7149bf3534e338f3070ae3abf
I think it was included in a pull request of ours at that time, but can't find that now. Sorry if we missed that.
It's meant to provide crash reasons and backtraces from user devices in the field. Uploading core dumps is not an option as most users are on mobile plans. My code copies just the essential crash info into RTC RAM. After reboot the info is sent as a crash debug record to the OVMS server. We keep the .elf files for all user builds and can feed the backtrace to gdb/addr2line to get an idea what happened.
See the whole OVMS crash debug code here: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/main/ovms_boot.cpp
I agree the current implementation may miss handling some edge cases, but it's been working very well for us and continues to provide crash backtraces (we're still on our idf 3 fork). We're now beginning migration to v4, so if you could add this feature officially to v4, that would be very appreciated.
Thanks & regards, Michael
Hi Michael, thanks for the extra background! I think for us it is much easier to support a concrete story of saving minimal information about the crash than to support arbitrary user code to be run from the panic handler. We already have an internal feature request for saving this additional minimal description of the crash, for the same reason as you mention (limiting data traffic to the cloud). It will likely be implemented in IDF 4.3, since 4.2 is now in feature freeze. Arbitrary user callbacks are still possible through the linker wrap feature, and I think IDF 4.2 makes overriding the panic handling logic somewhat easier by separating the target-specific code from the common one.
@igrr Hi, I'm also interested in this feature (saving a minimal description of a crash, so it can be uploaded after reset). Is there a public issue where I can track its progress? If it's still only being tracked internally, do you have an update on when it will be available? Thanks
@RoderickJDunn it is indeed only tracked internally at the moment. I see some progress being made, and very likely the implementation will be available publicly before the end of the year. If you create a new issue here on Github for this topic, I will then link it internally so that you get notified once the feature is published.
@igrr Is the feature of "save the crash info to be send after the reset" available? I am trying to implement this function using an ESP32 but I couldn't do it yet. Thanks.
@crisreimberg The feature to obtain the summary from the core dump is implemented, you need to call esp_core_dump_image_get
on startup to check if the core dump image exists (optionally after checking the reset reason), and then call esp_core_dump_get_summary
to obtain the summary information structure.
This is the first google result for "esp32 custom panic handler".
After all the other panic handlers run, I want to:
It seems this is still not easily possible? The memfault trick might cover the LED and piezo, but it's pretty involved to have to create a new component with special linking.
edit (March 2024): I just run all my "panic" code after reboot. Simpler. You can still access the panic info
This is the first google result for "esp32 custom panic handler".
+1
I'd like to turn the heater off if something wrong happens. As simple as GPIO set 0.
Similarly, I have a custom board which needs to be shut off in a certain sequence. Namely, an I2C command, a 50ms wait, and a few GPIOs enabled/disabled. Otherwise, there is a very loud pop when the panic handler resets. A way to add code to the panic handler would be great for this.
I ended up doing my "panic" code at boot time, depending on the esp_reset_reason.
might be a viable option for some.
It only takes a couple lines to do this with linker wrapping. IDF even includes a test case that does this exact thing, for reference:
Custom handler: https://github.com/espressif/esp-idf/blob/3befd5fff72aa6980514454a50233037718b611f/tools/test_apps/system/panic/main/panic_utils/memprot_panic_utils_xtensa.c#L47 And this line in CMakeLists: https://github.com/espressif/esp-idf/blob/3befd5fff72aa6980514454a50233037718b611f/tools/test_apps/system/panic/CMakeLists.txt#L13
@igrr I am not sure what happened to this issues, it seems like a bot made a decision to not to it? :-)
To add a possibility to intervene after a panic is essential to smoothly handling error states, like reporting problems and dynamically resolve them.
Another example, instead of having a reset, going through a short deep sleep does the same thing and allows you easily store information using RTC_DATA_ATTR to orderly avoid ending up in the same situation again.
In many situations it is way better to get stuck in a panic handler than in a constant reset loop.
it seems like a bot made a decision to not to it? :-)
The bot simply syncs some properties of the issue from our internal issue tracker. In this case I've marked the issue as "won't do" there and the bot has applied the label here on Github.
I didn't mean it as: "we will never support custom panic handler callbacks", though. I meant that for the requirements of the author of the issue, I think the custom panic handler callback is not the right solution: it would be more complex, and the use case is common enough anyway to not require a custom callback. We have implemented a different solution: as part of another ticket we have added an API to extract information from core dumps, and closed this one as "won't do".
If you can describe your use case in more detail, I can give feedback whether we would likely support it via custom panic handler callbacks or not. Just a note, entering deep sleep, despite the seemingly simple behavior, ends up using a fair number of system-level features. Because of this, I don't think that calling esp_deep_sleep_start from the panic handler is safe, in general.
In the meantime, if you feel confident about it, the comment just above shows a very simple way of overriding the panic handler.
@igrr I suspected the bot wasn't that smart. :-)
Basically it comes down to me using the ESP32 more as a "real" computer or server than an MCU. I am developing a framework on top of ESP-IDF (and Arduino) that provides a lot of abstraction for communication, error reporting and other things, so if something goes wrong in all that code, which it does, a reset is a very unproductive and uniforming way to handle the situation. Imagine if your Linux machine only did that.
I have code that can recover from many of these states, and if a process (ok, task) crashes, there is no reason to reset the whole thing down as a first measure. Sure, at some point a reset might be necessary, but then in a controlled manner.
To me, resetting using deep sleep actually is pretty sufficient, and especially since I get many of the nice things of the reset but gets to keep information that enables the unit to get going again quickly. The reason for the problem is likely to be in my code rather than the system level features anyway, :-)
WRT the comment above, it is not a very portable solution.
And to be quite honest i am not sure why one would not want to have the possibility of a custom error handler in a computer system? It is such an enabling feature?
Thanks for explaining your use case!
To me, resetting using deep sleep actually is pretty sufficient, and especially since I get many of the nice things of the reset but gets to keep information that enables the unit to get going again quickly.
Assuming you keep the state in RTC memory, this is possible with a reset (not just deep sleep) as well, if you use RTC_NOINIT_ATTR.
WRT the comment above, it is not a very portable solution.
Could you please expand on this a bit? By portable do you mean "portable to another RTOS"? Seems like the panic handler callback wouldn't be portable, either?
And to be quite honest i am not sure why one would not want to have the possibility of a custom error handler in a computer system?
Panic handling is quite a delicate task. Many aspects of the system can not be used or relied upon: heap allocation, C library functions, RTOS functions. Therefore documenting the exact rules for panic handler callbacks — which IDF APIs are allowed to be called and which aren't — is not an easy task. Troubleshooting panic handler issues is often quite tricky, and potential bugs where the panic handler constantly re-enters and never resets the chip are one of the worst classes of bugs to have in a consumer product. Basically, adding this feature is simple but providing support to customers who decide to use it could be quite resource intensive for us.
For users who "know what they are doing" and are okay with taking responsibility for any possible issues there's a way to wrap the panic handler, mentioned above. For example, here's an example of a similar approach in MemFault SDK.
Assuming you keep the state in RTC memory, this is possible with a reset (not just deep sleep) as well, if you use RTC_NOINIT_ATTR.
Argh! I mixed those up, and used the RTC_DATA_ATTR instead! I thought it was so weird that it didn't work for me any longer as I had it working before.
A note about that is that the documentation around RTC_DATA_ATTR is a bit confusing, would be better if it was more specific and not so much about stubs and so on. It doesn't say that this data is reset where that information is expected. In myworld, that would be the most usual use case for the RTC memory. And when googling on RTC memory I ended up there.
Could you please expand on this a bit? By portable do you mean "portable to another RTOS"? Seems like the panic handler callback wouldn't be portable, either?
If ESP-IDF had a simple callback, the framework would just need code for that functionality and wouldn't have to require users to create specialized project/linker level settings to get its error handling. Basically the suggested solution makes 3rd party libraries using the callback much more difficult to use and spread.
Panic handling is quite a delicate task. ....Therefore documenting the exact rules for panic handler callbacks — which IDF > APIs are allowed to be called and which aren't — is not an easy task.
Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there. And if you find out that something works reliably you can document that later. Using hardware the way it is not supposed to be used has lead to a ton of great stuff, the entire 64 demo scene and almost all its games, for example.
My use case is thus: a gpio pin turns a relay on/off. The relay turns a kiln on/off. If the kiln is on and a panic happens I would like to insert code to turn the kiln off for safety reasons.
simple solution: turn the gpio off after the device reboots.
I run all my panic handler code after reboot.
@chipweinberger Thanks for that. The problem I have is when I run in debug and there is no reboot. My app runs for upto 12 hours and I am not in attendance all that time. I have a very intermittent bug that I want to catch and I can't use debug for this reason.
simple solution Adding an effing callback is a pretty simple solution as well. :-)
It creates a lot of other opportunities, like creating higher-level frameworks that automatically reports issues, which is what I am doing.
TBH, I have not seen much in the way of strong arguments against it here, rather alternative solutions to specific problems, and that is another discussion, helpful as it may be.
@nicklasb I am sorry, I just realized I haven't replied to your previous comment.
If ESP-IDF had a simple callback, the framework would just need code for that functionality and wouldn't have to require users to create specialized project/linker level settings to get its error handling.
Link-time wrapping can be completely encapsulated in your custom component/library, I don't think anything would have to be done by the application developer. Basically, you can specify the additional linker arguments using a target_link_options
call in your component CMakeLists file. All the user will have to do is to add your component to their project.
Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there.
How useful would such a restrictive callback be for your use case, and would it enable anything that Core Dump currently doesn't? As I mentioned above to the OP, we would very much rather add the missing data to the Core Dump, and then you can handle the crash after a restart in the sane execution environment of app_main.
(I understand that shutting down GPIOs or peripherals on reset is indeed something we don't support. We'll look into adding an option to do that.)
Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there.
I understand your point. Perhaps not to the point of sueing, but we did face issues with customers' use of dangerous features in the past, hence my hesitation.
@igrr Being sued isn't top of my avoidance list it is being responsible for someone getting hurt. What is the arguement against making the function panic_abort's linkage weak?
Lately we have been trying to avoid weak linkage as the method for overriding functions, where possible. The argument is that when two components try to override the same function using weak linkage, the result is unpredictable. When using callback registration, we can at least return an error (indicating that a callback is already registered) or return the previous callback, allowing the components to "chain" the calls.
I would still love to know what functionality/information you are missing in core dumps to implement your error reporting features based on that...
Both an error and a especially a chain solves the problem elegantly (I edited for some reason I didn't see the entire answer before, so I suggested just that).
I understand that you feel you lose control a bit there, but and link-time wrapping is less flexible and more difficult, we the users also want some control. Also, it is not as easy to see that there is a linkage like that for the casual developer. Please with sugar on top. :-)
Feature request:
I wish I could set a custom panic handler callback from my main application in order to get some additional info and intercept crashes from my main component.
What I'd like to obtain
I'm not sure if there is another and better way to do what I'm trying to achieve. I'm currently working on esp-idf v4.0 and it looks like that this feature is still missing even in the future releases currently under development.
Applying these changes to each update is time-consuming, and maybe other users could benefit from this. I would love to have it out of the box.
My current solution. Tested and working
I have edited the esp32 and freertos components in order to get the described feature, however, I wish you could consider adding something like this as part of the framework.
Changes (additions) made to esp-idf v4.0 to get the feature in question:
esp-idf/components/esp32/Kconfig:
esp-idf/components/esp32/panic.c:
esp-idf/components/esp32/panic.c commonErrorHandler():
components/freertos/include/freertos/xtensa_api.h:
Thank you in advance.