Problem: Crash and Reboot on Long Jobs (still occurs on v3.7.4)

msxmatt commented 1 year ago

Wiki Search Terms

N/A

Controller Board

Custom ESP32-S3 board

Machine Description

3-axis gantry machine

Input Circuits

No response

Configuration file

name: "custom"
board: "custom"

stepping:
  engine: TIMED
  idle_ms: 255
  dir_delay_us: 1
  pulse_us: 4
  disable_delay_us: 1

start:
  must_home: false

planner_blocks: 16

control:
    feed_hold_pin: gpio.36:low

# uart1:
#   txd_pin: gpio.22
#   rxd_pin: gpio.21
#   baud: 115200
#   mode: 8N1

kinematics:
  corexy:

# 20t pulley
# steps per mm/microsteps   80/16   40/8    20/4    10/2    5/1
# resolution in mm          0.0125  0.025   0.050   0.100   0.200

# 16t pulley
# steps per mm/microsteps   100/16  50/8    25/4    12.5/2  6.25/1
# resolution in mm          0.010   0.020   0.050   0.080   0.160

# 1/4 microsteps gives best torque for a wide range of speeds

# Board stepper driver address and gpio
# axis  add   step  dir
# x     1     27    26
# y     3     33    32
# z     0     14    12
# e     2     16    17

axes:
  shared_stepper_disable_pin: NO_PIN
  shared_stepper_reset_pin: NO_PIN

  x:
    steps_per_mm: 40
    max_rate_mm_per_min: 10000
    acceleration_mm_per_sec2: 5000
    max_travel_mm: 1275
    soft_limits: false
    homing:
      cycle: 2
      allow_single_axis: true
      mpos_mm: 0
      positive_direction: false
      seek_mm_per_min: 4000
      feed_mm_per_min: 600
      settle_ms: 200
      seek_scaler: 1.100
      feed_scaler: 1.100

    motor0:
      limit_neg_pin: gpio.1:high
      pulloff_mm: 10
      stepstick:
        disable_pin: gpio.15:low
        direction_pin: gpio.16:low
        step_pin: gpio.17
    motor1:
      null_motor:

  y:
    steps_per_mm: 40
    max_rate_mm_per_min: 10000
    acceleration_mm_per_sec2: 5000
    max_travel_mm: 1024
    soft_limits: false
    homing:
      cycle: 1
      allow_single_axis: true
      mpos_mm: -1025
      positive_direction: false
      seek_mm_per_min: 4000
      feed_mm_per_min: 600
      settle_ms: 200
      seek_scaler: 1.100
      feed_scaler: 1.100

    motor0:
      limit_neg_pin: gpio.2:high
      pulloff_mm: 10
      stepstick:
        disable_pin: gpio.18:low
        direction_pin: gpio.21:low
        step_pin: gpio.26
    motor1:
      null_motor:

  z:
    steps_per_mm: 16
    max_rate_mm_per_min: 3000.000
    acceleration_mm_per_sec2: 3000.000
    max_travel_mm: 25
    soft_limits: false
    homing:
      cycle: -1

    motor0:
      #limit_pos_pin: gpio.4:high
      pulloff_mm: 4
      stepstick:
        disable_pin: gpio.47:low
        direction_pin: gpio.33
        step_pin: gpio.34
    motor1:
      null_motor:

spi:
  miso_pin: gpio.8
  mosi_pin: gpio.9
  sck_pin: gpio.10

sdcard:
  cs_pin: gpio.12
  card_detect_pin: gpio.14
  frequency_hz: 8000000

user_outputs:
  digital0_pin: gpio.5
  digital1_pin: gpio.43
  digital2_pin: gpio.44
  digital3_pin: gpio.42
  digital4_pin: gpio.6

# run a fan off the 24V/5V fan header – switch on and off with the WebUI mist button or by issuing an M8 command, turn off with M9
coolant:
  mist_pin: gpio.13
  delay: 0

macros:
  startup_line0:
  startup_line1:
  macro0:
  macro1:
  macro2:
  macro3:

Startup Messages

[MSG:INFO: FluidNC bantam_v0.4.0]
[MSG:INFO: Compiled with ESP32 SDK:v4.4.4]
[MSG:INFO: Local filesystem type is spiffs]
[MSG:INFO: RESET: Software reset due to exception/panic]
[MSG:ERR: Skipping configuration file due to panic]
[MSG:INFO: Using default configuration]
[MSG:INFO: Axes: using defaults]
[MSG:INFO: Machine Default (Test Drive)]
[MSG:INFO: Board None]
[MSG:INFO: Stepping:RMT Pulse:4us Dsbl Delay:0us Dir Delay:0us Idle Delay:255ms]
[MSG:INFO: Axis count 3]
[MSG:INFO: Axis X (-1000.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO: Axis Y (-1000.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO: Axis Z (-1000.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO: Kinematic system: Cartesian]
[MSG:INFO: Using spindle NoSpindle]
[MSG:INFO: Connecting to STA SSID:Workshop]
[MSG:INFO: Connecting.]
[MSG:INFO: Connecting..]
[MSG:INFO: Connected - IP is 192.168.5.55]
[MSG:INFO: WiFi on]
[MSG:INFO: Start mDNS with hostname:http://fluidnc.local/]
[MSG:INFO: SSDP Started]
[MSG:INFO: HTTP started on port 80]
[MSG:INFO: Telnet started on port 23]

User Interface Software

WebUI, FluidTerm

What happened?

We're running a custom ESP32-S3 based board to run FluidNC (thanks for your work on S3 so far). On a long job (typically 18 hours), we're seeing our machine randomly reboot at some point during the run, similar to #936 . Typically the error looks like this:

[MSG:INFO: Mount_sd]
[MSG:INFO: Unmount_sd]
[MSG:INFO: Channel auto report interval set to 50 ms]
[MSG:INFO: Mount_sd]
Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed). 

Core  1 register dump:
PC      : 0xbad00bad  PS      : 0x00060034  A0      : 0x803757e9  A1      : 0x3fc9a2e0
A2      : 0x3fcf7354  A3      : 0x00000001  A4      : 0x00000020  A5      : 0x80000000
A6      : 0x3fca506c  A7      : 0x3fca4a2c  A8      : 0x8037c559  A9      : 0xffffffff
A10     : 0x3fceeb20  A11     : 0x00000001  A12     : 0x00000000  A13     : 0x00000020
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000006  EXCCAUSE: 0x00000007
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000

Backtrace: 0x7ad00baa:0x3fc9a2e0 |<-CORRUPTED

ELF file SHA256: aeff0da83d1ef40a

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x28 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40045c0c
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3808,len:0x44c
load:0x403c9700,len:0xbe4
load:0x403cc700,len:0x2a38
entry 0x403c98d4

It seemed to be related to Wi-Fi sockets since we've been able to run the job once or twice with Wi-Fi off, however it's hard to tell with such a long time to failure. We've merged in the latest (v3.7.4) since it sounds like there's been a lot of improvement in the wifi code, but the issue persists.

Currently, we're monitoring the heap using heapCheckTask and we don't seem to have a memory leak so that's good 👍 . I thought it might be an SD card failure, so I'm currently debugging.

Is this still caused by websockets or potentially something else? Happy to debug as needed.

GCode File

(file is too big to attach)

Other Information

No response

MitchBradley commented 1 year ago

It would be helpful to see the startup messages from before the crash. The post-crash startup messages are useless since they show the results of parsing the built-in null configuration.

msxmatt commented 1 year ago

Sure let me get a log for a full run that captures the startup before the crash.

msxmatt commented 1 year ago

I went ahead and commented out this section of code in Webserver.cpp as it mentions it can help reproduce IRAM errors:

        if (hash.length() && std::string(_webserver->header("If-None-Match").c_str()) == hash) {
            log_debug(path << " is cached");
            _webserver->send(304);
            return true;
        }
        // If you load or reload WebUI while a program is running, there is a high
        // risk of stalling the motion because serving a file from
        // the local FLASH filesystem takes away a lot of CPU cycles.  If we get
        // a request for a file when running, reject it to preserve the motion
        // integrity.
        // This can make it hard to debug ISR IRAM problems, because the easiest
        // way to trigger such problems is to refresh WebUI during motion.
        if (http_block_during_motion->get() && inMotionState()) {
            Web_Server::handleReloadBlocked();
            return true;
        }

I did another run, and once the job started I was able to trigger a reboot right after I closed the webUI (I didn't even get a chance to refresh):

[MSG:INFO: FluidNC bantam_v0.4.1]
[MSG:INFO: Compiled with ESP32 SDK:v4.4.4]
[MSG:INFO: Local filesystem type is littlefs]
[MSG:INFO: RESET: Reset due to power-on event]
[MSG:INFO: Configuration file:config.yaml]
[MSG:INFO: Machine Custom][MSG:INFO: Stepping:Timed Pulse:4us Dsbl Delay:1us Dir Delay:1us Idle Delay:255ms]
[MSG:INFO: User Digital Output:0 on Pin:gpio.5]
[MSG:INFO: User Digital Output:1 on Pin:gpio.43]
[MSG:INFO: User Digital Output:2 on Pin:gpio.44]
[MSG:INFO: User Digital Output:3 on Pin:gpio.42]
[MSG:INFO: User Digital Output:4 on Pin:gpio.6]
[MSG:INFO: Axis count 3]
[MSG:INFO: Axis X (0.000,1275.000)]
[MSG:INFO:   Motor0]
[MSG:INFO:     stepstick Step:gpio.17 Dir:gpio.16:low Disable:gpio.15:low]
[MSG:INFO:  X Neg Limit gpio.1]
[MSG:INFO:   Motor1]
[MSG:INFO: Axis Y (-1025.000,-1.000)]
[MSG:INFO:   Motor0]
[MSG:INFO:     stepstick Step:gpio.26 Dir:gpio.21:low Disable:gpio.18:low]
[MSG:INFO:  Y Neg Limit gpio.2]
[MSG:INFO:   Motor1]
[MSG:INFO: Axis Z (-25.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO:     stepstick Step:gpio.34 Dir:gpio.33 Disable:gpio.47:low]
[MSG:INFO:   Motor1]
[MSG:INFO: enter_pin gpio.36:low]
[MSG:INFO: Kinematic system: CoreXY]
[MSG:INFO: Using spindle NoSpindle]
[MSG:INFO: Connecting to STA SSID:wifi]
[MSG:INFO: Connecting.]
[MSG:INFO: Connecting..]
[MSG:INFO: Connected - IP is 192.168.86.40]
[MSG:INFO: WiFi on]
[MSG:INFO: Start mDNS with hostname:http://fluidnc.local/]
[MSG:INFO: SSDP Started]
[MSG:INFO: HTTP started on port 80]
[MSG:INFO: Telnet started on port 23]

Grbl bantam_0.4 [FluidNC bantam_v0.4.1 (wifi) '$' for help]
[MSG:INFO: SD Card Detect Event]
[MSG:INFO: Mount_sd]
[MSG:INFO: Unmount_sd]
[MSG:INFO: Channel auto report interval set to 50 ms]
[MSG:INFO: Mount_sd]
Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed). 

Core  1 register dump:
PC      : 0xbad00bad  PS      : 0x00060034  A0      : 0x80376312  A1      : 0x3fc9a490  
A2      : 0x00000000  A3      : 0x3fcee20c  A4      : 0x3fca4424  A5      : 0x00000003  
A6      : 0x00000003  A7      : 0x00000000  A8      : 0x8037583e  A9      : 0x3fc9a470  
A10     : 0x3fcee310  A11     : 0x00ffffff  A12     : 0x00000020  A13     : 0x80000000  
A14     : 0x00000000  A15     : 0x3fcebd38  SAR     : 0x00000004  EXCCAUSE: 0x00000007  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace: 0x7ad00baa:0x3fc9a490 |<-CORRUPTED

ELF file SHA256: 77e0c41ac4c7a9d7

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x2b (SPI_FAST_FLASH_BOOT)
Saved PC:0x40001c38
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3808,len:0x44c
load:0x403c9700,len:0xbe4
load:0x403cc700,len:0x2a38
entry 0x403c98d4

[MSG:INFO: FluidNC bantam_v0.4.1]
[MSG:INFO: Compiled with ESP32 SDK:v4.4.4]
[MSG:INFO: Local filesystem type is littlefs]
[MSG:INFO: RESET: Software reset due to exception/panic]
[MSG:ERR: Skipping configuration file due to panic]
[MSG:INFO: Using default configuration]
[MSG:INFO: Axes: using defaults]
[MSG:INFO: Machine Default (Test Drive)]
[MSG:INFO: Board None]
[MSG:INFO: Stepping:RMT Pulse:4us Dsbl Delay:0us Dir Delay:0us Idle Delay:255ms]
[MSG:INFO: Axis count 3]
[MSG:INFO: Axis X (-1000.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO: Axis Y (-1000.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO: Axis Z (-1000.000,0.000)]
[MSG:INFO:   Motor0]
[MSG:INFO: Kinematic system: Cartesian]
[MSG:INFO: Using spindle NoSpindle]
[MSG:INFO: Connecting to STA SSID:wifi]
[MSG:INFO: Connecting.]
[MSG:INFO: Connecting..]
[MSG:INFO: Connected - IP is 192.168.86.40]
[MSG:INFO: WiFi on]
[MSG:INFO: Start mDNS with hostname:http://fluidnc.local/]
[MSG:INFO: SSDP Started]
[MSG:INFO: HTTP started on port 80]
[MSG:INFO: Telnet started on port 23]

Grbl bantam_0.4 [FluidNC bantam_v0.4.1 (wifi) '$' for help]
[MSG:INFO: '$H'|'$X' to unlock]

I'm not sure how helpful this is, but I'll use this method to see if I can hunt down the source of the IRAM errors.

MitchBradley commented 1 year ago

Turn off autoreporting to see if that stops or reduces the crashes

msxmatt commented 1 year ago

I set $Report/Interval=0 and ran again; it still crashes with the same error when I close the webUI

We tried to also disable autoreporting just during the job in the code, but it still has issues.

I'll leave reporting turned off for now.

MitchBradley commented 1 year ago

Okay, so that rules out websocket close races as the problem, at least for now.

msxmatt commented 1 year ago

Yea, I've also tried both RMT and TIMED for our steppers, since our code used TIMED and some of the IRAM errors seemed to come from step/unstep() functions. Both modes cause the crash.

bdring / FluidNC