Duet3D / RepRapFirmware

OO C++ RepRap Firmware
GNU General Public License v3.0
927 stars 532 forks source link

Regular crashes on Duet 3 Mini WiFi #935

Open dc42 opened 8 months ago

dc42 commented 8 months ago

See thread https://forum.duet3d.com/topic/33998/reboots-crashes-rrf-3-5-0-rc1. Only one user is currently reporting this, however there are some similarities in the stack trace each time which suggests that it may be a firmware issue.

dc42 commented 8 months ago

Stack traces from crashes using 3.5.0-rc.1 or rc.1+ copied from forum thread:

RepRapFirmware for Duet 3 Mini 5+ version 3.5.0-rc.1 (2023-08-31 16:16:56) running on Duet 3 Mini5plus WiFi (standalone mode) Last software reset at 2023-10-29 11:07, reason: HardFault invState, Gcodes spinning, available RAM 4516, slot 2 Software reset code 0x4063 HFSR 0x40000000 CFSR 0x00020000 ICSR 0x00487803 BFAR 0xe000ed38 SP 0x20011fa8 Task NETW Freestk 482 ok Stack: 000001ae 20032c0a 0000000a 00000000 20032c0a 0009dff9 00000000 600f0000 00000000 00000000 00000000 00000000 20031a2c 00000800 20034d50 2002bf30 20018668 2002bd9d 20018668 2001e880 0002fedf 00000000 00000000 00000000 20012058 00000014 b5dd8a35 Error status: 0x04 Aux0 errors 0,0,0 MCU revision 3, ADC conversions started 1423060, completed 1423060, timed out 0, errs 0 MCU temperature: min 34.0, current 34.7, max 37.3 Supply voltage: min 22.9, current 24.1, max 26.1, under voltage events: 0, over voltage events: 0, power good: yes Heap OK, handles allocated/used 99/33, heap memory allocated/used/recyclable 2048/792/356, gc cycles 66 Events: 0 queued, 0 completed

RepRapFirmware for Duet 3 Mini 5+ version 3.5.0-rc.1 (2023-08-31 16:16:56) running on Duet 3 Mini5plus WiFi (standalone mode) Last software reset at 2023-10-29 18:28, reason: HardFault invState, Gcodes spinning, available RAM 5200, slot 0 Software reset code 0x4063 HFSR 0x40000000 CFSR 0x00020000 ICSR 0x00000803 BFAR 0xe000ed38 SP 0x20011fa8 Task NETW Freestk 482 ok Stack: 000001b0 00000002 200014ec 00000000 ffffffff 0009df2d 00000000 600f0000 00000000 00000000 00000000 00000000 200301d4 00000800 20035710 2002bf00 20018668 2002bd9d 20018668 2001e880 0002fedf 00000000 00000000 00000000 20012058 00000014 00000000 Error status: 0x00 Aux0 errors 0,0,0 MCU revision 3, ADC conversions started 5068453, completed 5068451, timed out 0, errs 0 MCU temperature: min 34.4, current 35.1, max 37.8 Supply voltage: min 23.1, current 24.1, max 26.0, under voltage events: 0, over voltage events: 0, power good: yes Heap OK, handles allocated/used 99/33, heap memory allocated/used/recyclable 2048/564/128, gc cycles 236 Events: 0 queued, 0 completed

RepRapFirmware for Duet 3 Mini 5+ version 3.5.0-rc.1+ (2023-11-01 10:29:03) running on Duet 3 Mini5plus WiFi (standalone mode) Last software reset at 2023-11-03 23:29, reason: HardFault invState, Gcodes spinning, available RAM 10940, slot 2 Software reset code 0x4063 HFSR 0x40000000 CFSR 0x00020000 ICSR 0x00000803 BFAR 0xe000ed38 SP 0x20011f88 Task NETW Freestk 482 ok Stack: 000001b0 00000002 200014e8 00000000 ffffffff 0009e9cd 00000000 600f0000 00000000 00000000 00000000 00000000 20031c4c 00000800 2002c0e0 2002c0e0 00000001 2002bf7d 20018658 2001e868 0002ff97 00000000 00000000 00000000 20012038 00000014 b5dd8a35 Error status: 0x04 Aux0 errors 0,0,0 MCU revision 3, ADC conversions started 785822, completed 785822, timed out 0, errs 0 MCU temperature: min 37.0, current 37.4, max 40.1 Supply voltage: min 22.5, current 24.1, max 26.5, under voltage events: 0, over voltage events: 0, power good: yes Heap OK, handles allocated/used 99/33, heap memory allocated/used/recyclable 2048/1856/1420, gc cycles 36 Events: 1 queued, 1 completed

dc42 commented 8 months ago

Common factors in stack traces:

0009e9cd in rc1+ dated 1 Nov is xQueueGenericSend + 0x141 which is this code:

 656 013c FFF7FEFF      bl  vPortExitCritical
 657 0140 0120          movs    r0, #1
 658 0142 05B0          add sp, sp, #20
 659                    @ sp needed
 660 0144 F0BD          pop {r4, r5, r6, r7, pc}

0002ff97 in rc1+ of 1nov23 is WiFiSocket::Poll + 0x147 which is this code:

 870                .L173:
 871 0136 BDF82630      ldrh    r3, [sp, #38]
 872 013a BDF82410      ldrh    r1, [sp, #36]
 873 013e A384          strh    r3, [r4, #36]   @ movhi
 874 0140 2046          mov r0, r4
 875 0142 FFF7FEFF      bl  _ZN10WiFiSocket11ReceiveDataEt
 876 0146 97E7          b   .L157
 877                .L208:
.L157 is this:
 788                .L157:
 789 0078 0023          movs    r3, #0
 790 007a 84F82830      strb    r3, [r4, #40]
 791 007e 0DB0          add sp, sp, #52
 792                    @ sp needed
 793 0080 30BD          pop {r4, r5, pc}

_ZN10WiFiSocket11ReceiveDataEt is this:

559 0000 2DE9F041       push    {r4, r5, r6, r7, r8, lr}
 560 0004 0546          mov r5, r0
 561 0006 86B0          sub sp, sp, #24
 562 0008 0C46          mov r4, r1
 563 000a 19B9          cbnz    r1, .L138
 564                .L113:
 565 000c 2C77          strb    r4, [r5, #28]
 566 000e 06B0          add sp, sp, #24
 567                    @ sp needed
 568 0010 BDE8F081      pop {r4, r5, r6, r7, r8, pc}
 569                .L138:
 570 0014 8069          ldr r0, [r0, #24]
 571 0016 FFF7FEFF      bl  _ZN13NetworkBuffer8FindLastEPS_

vPortExitCritical is this:

 258                vPortExitCritical:
 259                    @ args = 0, pretend = 0, frame = 0
 260                    @ frame_needed = 0, uses_anonymous_args = 0
 261 0000 074A          ldr r2, .L38
 262 0002 08B5          push    {r3, lr}
 263 0004 1368          ldr r3, [r2]
 264 0006 2BB1          cbz r3, .L37
 265 0008 013B          subs    r3, r3, #1
 266 000a 1360          str r3, [r2]
 267 000c 0BB9          cbnz    r3, .L33
 268                    .syntax unified
 269                @ 229 "C:\Eclipse\Firmware\FreeRTOS\src\portable\GCC\ARM_CM4F/portmacro.h" 1
 270 000e 83F31188         msr basepri, r3 
 271                @ 0 "" 2
 272                    .thumb
 273                    .syntax unified
 274                .L33:
 275 0012 08BD          pop {r3, pc}
 276                .L37:
 277 0014 0349          ldr r1, .L38+4
 278 0016 4FF4D370      mov r0, #422
 279 001a FFF7FEFF      bl  vAssertCalled
 280                .L39:
 281 001e 00BF          .align  2
 282                .L38:
 283 0020 00000000      .word   uxCriticalNesting
dc42 commented 8 months ago

Added check for return address corruption in WiFiInterface::SendCommand. Now waiting for user to test it and report, see https://forum.duet3d.com/post/327312.

dc42 commented 7 months ago

Return address corruption, when it occurs, occurs during the call to Disable Spi(). The data written appears to be a single 32bit word and is always either zero or what looks like 4 ASCII characters. Character patterns seen are =0.1 and =0.6 . I have added a debug watchpoint on the stored return address to see if it is the processor core that writes to it. Now waiting for the user to test it.

dc42 commented 2 months ago

The debug watchpoint added (see previous comment) did not trigger, therefore it seems that it is not the CPU that is writing the zero word. It must be DMA or the DMA registers write-back function, or possibly a caching issue.

These reports https://forum.duet3d.com/topic/35611/my-printer-reboots-alone-3-5-1/ look to me like instances of the same issue although the stack location that is overwritten by zeros is different.

dc42 commented 1 month ago

Another likely instance: https://forum.duet3d.com/topic/35833/duet-3-mini5-nightly-restarts/