Open msxmatt opened 1 year ago
It would be helpful to see the startup messages from before the crash. The post-crash startup messages are useless since they show the results of parsing the built-in null configuration.
Sure let me get a log for a full run that captures the startup before the crash.
I went ahead and commented out this section of code in Webserver.cpp
as it mentions it can help reproduce IRAM errors:
if (hash.length() && std::string(_webserver->header("If-None-Match").c_str()) == hash) {
log_debug(path << " is cached");
_webserver->send(304);
return true;
}
// If you load or reload WebUI while a program is running, there is a high
// risk of stalling the motion because serving a file from
// the local FLASH filesystem takes away a lot of CPU cycles. If we get
// a request for a file when running, reject it to preserve the motion
// integrity.
// This can make it hard to debug ISR IRAM problems, because the easiest
// way to trigger such problems is to refresh WebUI during motion.
if (http_block_during_motion->get() && inMotionState()) {
Web_Server::handleReloadBlocked();
return true;
}
I did another run, and once the job started I was able to trigger a reboot right after I closed the webUI (I didn't even get a chance to refresh):
[MSG:INFO: FluidNC bantam_v0.4.1]
[MSG:INFO: Compiled with ESP32 SDK:v4.4.4]
[MSG:INFO: Local filesystem type is littlefs]
[MSG:INFO: RESET: Reset due to power-on event]
[MSG:INFO: Configuration file:config.yaml]
[MSG:INFO: Machine Custom][MSG:INFO: Stepping:Timed Pulse:4us Dsbl Delay:1us Dir Delay:1us Idle Delay:255ms]
[MSG:INFO: User Digital Output:0 on Pin:gpio.5]
[MSG:INFO: User Digital Output:1 on Pin:gpio.43]
[MSG:INFO: User Digital Output:2 on Pin:gpio.44]
[MSG:INFO: User Digital Output:3 on Pin:gpio.42]
[MSG:INFO: User Digital Output:4 on Pin:gpio.6]
[MSG:INFO: Axis count 3]
[MSG:INFO: Axis X (0.000,1275.000)]
[MSG:INFO: Motor0]
[MSG:INFO: stepstick Step:gpio.17 Dir:gpio.16:low Disable:gpio.15:low]
[MSG:INFO: X Neg Limit gpio.1]
[MSG:INFO: Motor1]
[MSG:INFO: Axis Y (-1025.000,-1.000)]
[MSG:INFO: Motor0]
[MSG:INFO: stepstick Step:gpio.26 Dir:gpio.21:low Disable:gpio.18:low]
[MSG:INFO: Y Neg Limit gpio.2]
[MSG:INFO: Motor1]
[MSG:INFO: Axis Z (-25.000,0.000)]
[MSG:INFO: Motor0]
[MSG:INFO: stepstick Step:gpio.34 Dir:gpio.33 Disable:gpio.47:low]
[MSG:INFO: Motor1]
[MSG:INFO: enter_pin gpio.36:low]
[MSG:INFO: Kinematic system: CoreXY]
[MSG:INFO: Using spindle NoSpindle]
[MSG:INFO: Connecting to STA SSID:wifi]
[MSG:INFO: Connecting.]
[MSG:INFO: Connecting..]
[MSG:INFO: Connected - IP is 192.168.86.40]
[MSG:INFO: WiFi on]
[MSG:INFO: Start mDNS with hostname:http://fluidnc.local/]
[MSG:INFO: SSDP Started]
[MSG:INFO: HTTP started on port 80]
[MSG:INFO: Telnet started on port 23]
Grbl bantam_0.4 [FluidNC bantam_v0.4.1 (wifi) '$' for help]
[MSG:INFO: SD Card Detect Event]
[MSG:INFO: Mount_sd]
[MSG:INFO: Unmount_sd]
[MSG:INFO: Channel auto report interval set to 50 ms]
[MSG:INFO: Mount_sd]
Guru Meditation Error: Core 1 panic'ed (Cache disabled but cached memory region accessed).
Core 1 register dump:
PC : 0xbad00bad PS : 0x00060034 A0 : 0x80376312 A1 : 0x3fc9a490
A2 : 0x00000000 A3 : 0x3fcee20c A4 : 0x3fca4424 A5 : 0x00000003
A6 : 0x00000003 A7 : 0x00000000 A8 : 0x8037583e A9 : 0x3fc9a470
A10 : 0x3fcee310 A11 : 0x00ffffff A12 : 0x00000020 A13 : 0x80000000
A14 : 0x00000000 A15 : 0x3fcebd38 SAR : 0x00000004 EXCCAUSE: 0x00000007
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace: 0x7ad00baa:0x3fc9a490 |<-CORRUPTED
ELF file SHA256: 77e0c41ac4c7a9d7
Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x2b (SPI_FAST_FLASH_BOOT)
Saved PC:0x40001c38
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3808,len:0x44c
load:0x403c9700,len:0xbe4
load:0x403cc700,len:0x2a38
entry 0x403c98d4
[MSG:INFO: FluidNC bantam_v0.4.1]
[MSG:INFO: Compiled with ESP32 SDK:v4.4.4]
[MSG:INFO: Local filesystem type is littlefs]
[MSG:INFO: RESET: Software reset due to exception/panic]
[MSG:ERR: Skipping configuration file due to panic]
[MSG:INFO: Using default configuration]
[MSG:INFO: Axes: using defaults]
[MSG:INFO: Machine Default (Test Drive)]
[MSG:INFO: Board None]
[MSG:INFO: Stepping:RMT Pulse:4us Dsbl Delay:0us Dir Delay:0us Idle Delay:255ms]
[MSG:INFO: Axis count 3]
[MSG:INFO: Axis X (-1000.000,0.000)]
[MSG:INFO: Motor0]
[MSG:INFO: Axis Y (-1000.000,0.000)]
[MSG:INFO: Motor0]
[MSG:INFO: Axis Z (-1000.000,0.000)]
[MSG:INFO: Motor0]
[MSG:INFO: Kinematic system: Cartesian]
[MSG:INFO: Using spindle NoSpindle]
[MSG:INFO: Connecting to STA SSID:wifi]
[MSG:INFO: Connecting.]
[MSG:INFO: Connecting..]
[MSG:INFO: Connected - IP is 192.168.86.40]
[MSG:INFO: WiFi on]
[MSG:INFO: Start mDNS with hostname:http://fluidnc.local/]
[MSG:INFO: SSDP Started]
[MSG:INFO: HTTP started on port 80]
[MSG:INFO: Telnet started on port 23]
Grbl bantam_0.4 [FluidNC bantam_v0.4.1 (wifi) '$' for help]
[MSG:INFO: '$H'|'$X' to unlock]
I'm not sure how helpful this is, but I'll use this method to see if I can hunt down the source of the IRAM errors.
Turn off autoreporting to see if that stops or reduces the crashes
I set $Report/Interval=0
and ran again; it still crashes with the same error when I close the webUI
We tried to also disable autoreporting just during the job in the code, but it still has issues.
I'll leave reporting turned off for now.
Okay, so that rules out websocket close races as the problem, at least for now.
Yea, I've also tried both RMT and TIMED for our steppers, since our code used TIMED and some of the IRAM errors seemed to come from step/unstep() functions. Both modes cause the crash.
Wiki Search Terms
N/A
Controller Board
Custom ESP32-S3 board
Machine Description
3-axis gantry machine
Input Circuits
No response
Configuration file
Startup Messages
User Interface Software
WebUI, FluidTerm
What happened?
We're running a custom ESP32-S3 based board to run FluidNC (thanks for your work on S3 so far). On a long job (typically 18 hours), we're seeing our machine randomly reboot at some point during the run, similar to #936 . Typically the error looks like this:
It seemed to be related to Wi-Fi sockets since we've been able to run the job once or twice with Wi-Fi off, however it's hard to tell with such a long time to failure. We've merged in the latest (v3.7.4) since it sounds like there's been a lot of improvement in the wifi code, but the issue persists.
Currently, we're monitoring the heap using
heapCheckTask
and we don't seem to have a memory leak so that's good 👍 . I thought it might be an SD card failure, so I'm currently debugging.Is this still caused by websockets or potentially something else? Happy to debug as needed.
GCode File
(file is too big to attach)
Other Information
No response