minosg commented 3 years ago

Bug Description

This ticket is aimed to report a correlation on a common but hard to reproduce issue across a variety of unrelated users and configurations. I have been following older tickets and running experiments on a spare machine of mine using the BTT_SKR_MINI board 1.2 and Malyan_LCD

18117

18315

17161

Possibly related

18356

On STM32F1 platforms the watchdog is disabled which explains why it freezes in that mode, but I do no think the issue is limited to them

18226

My Configurations

configurations.zip

Steps to Reproduce

Download any latest marlin version 2.5.0X or latest bugfix
Use a board with TMC steppers in serial mode
Enable an LCD controller with UART connection, MaylanLCD is perferct because of the high baud-rate 50000 it requires. ( It makes it more likely to occur ). If your display allows you to
change the baud rate increase it to its maximum
Enable Linear advance, classic jerk and set your extruder Ejerk to something low like 2.0. This is aimed at increasing the stress on timing tolerances of planner.cpp
Print a high retraction model (Gecko) with high acceleration settings (la 02, EJerk 2.0, max extruder acceleration 5000+, retract on layer change & wipe)
Keep on printing until it freezes. It should never reach 5 times, most of the times it will happen in the first three prints.

Expected behavior:

I expect all the prints to complete

Actual behavior: At random occurrences the printer freezes with heaters on, and if watchdog is enabled it just resets

Additional Information

As discussed before #18177 , I have confirmed that most of the previous reported workarounds, while they make the issue less frequent, will not fix it. I have confirmed that it freezes when

Try older release tags of marlin.
Reduce the TMC serial speed to 115200. Mini e3 is using software serial which is not ideal.
Enable DISABLE_MULTI_STEPPING
Try with a higher watt power supply ( 220W )
Work with a higher voltage threshold 24V instead of 12V ( that is what you just confirmed )
Disable steathchop and run the drivers in Spreadcycle mode ( less frequent but it happens )
Disable Linear Advance, Hybrid threshold altogether
Set minimum square pulse to 2, and 4
Enable Square Wave Stepping
Enabled velocity ramping by increasing the ihold_delay using tmc_adv()
Reduce the uart baud rate of the TMC steppers.
Move TMC steppers on hardware uart
Enabled disable flow control, and crc on SD card
Changing the priority of the TMC stepper to 0

The only common occurrence across all tickets submitted with related behavior in the last 6 months are:

TMC steppers are running in UART mode
Screen is running in UART mode

I have not been able to locate a single report of it happening with steppers on standalone legacy mode.

What I believe is the actually cause of this issue: Theory 1 An interrupt kicks in when already in exception mode. Lr 0xfffffff9 indicates a EXC_RETURN state. When trying to recover the original LR it has been corrupted and the code lands in an undefined space

Theory 2: It could also be possible that the UartInterrupt flag for the screen uart is not properly cleared and the system locks in this loop of libmaple's usart_private.h

static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */
    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }
}

Consecutive triggering of the interrupt could cause stack corruption, resulting in a improperly formatted LR register

Last debug session call stack when the incident occurred

__irq_usart1@0x080093f0 (c:\Users\Minoas\.platformio\packages\framework-arduinoststm32-maple\STM32F1\cores\maple\libmaple\usart_f1.c:204)
<signal handler called>@0xfffffff9 (Unknown Source:0)
ExtUI::getProgress_percent@0x0801cfd0 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\lcd\extui\ui_api.cpp:809)
ExtUI::onIdle@0x0801dabc (c:\Users\Minoas\bigtree\Marlin\Marlin\src\lcd\extui_malyan_lcd.cpp:476)
MarlinUI::update@0x0801d1c8 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\lcd\extui\ui_api.cpp:1064)
idle@0x0800c7d4 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\MarlinCore.cpp:718)
Planner::get_next_free_block@0x08021b2a (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.h:649)
Planner::_buffer_steps@0x08021b2a (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.cpp:1673)
Planner::buffer_segment@0x08021cf2 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.cpp:2709)
Planner::buffer_segment@0x08021d7c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.h:743)
Planner::buffer_line@0x08021d7c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.cpp:2781)
Planner::buffer_line@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.h:774)
segmented_line_to_destination@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\motion.cpp:874)
line_to_destination_cartesian@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\motion.cpp:899)
prepare_line_to_destination@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\motion.cpp:1087)
GcodeSuite::G0_G1@0x08019e64 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\motion\G0_G1.cpp:105)
GcodeSuite::process_parsed_command@0x08018cdc (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\gcode.cpp:245)
GcodeSuite::process_next_command@0x0801921c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\gcode.cpp:935)
GCodeQueue::advance@0x0801b6b2 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\queue.cpp:649)
loop@0x0800c89c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\MarlinCore.cpp:1226)
main@0x0800ae8a (c:\Users\Minoas\.platformio\packages\framework-arduinoststm32-maple\STM32F1\cores\maple\main.cpp:41)

Registers state

sp = 0x2000be50
pc = 0x080093f0
lr = 0xfffffff9
psp = 0xcc1fd348
mps = 0x2000be50
XpSR = 0x41000035
primask = basepri = faultmask = control = 0x0

Workaround/Patches

The issue has been correctly identified and can be mitigated by patching your local maple libary, usually residing inside the platform folder

In windows the location of the file which is needed to be edited is: C:\Users\YOURUSERNAME.platformio\packages\framework-arduinoststm32-maple\STM32F1\system\libmaple\usart_private.h

In MacOS open a terminal and use the following command to find your platformio folder open ~/.platformio

In Linux it should be in the same location cd~/.platformio

The callback for usart in usart_private.h needs to be replaced by either patch

PATCH 1


static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */
    volatile uint8_t moufa;
    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_ORE)) {
        moufa = (uint8)regs->DR;
    }
}

PATCH 2

static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_ORE)) {
        #ifdef USART_SAFE_INSERT
                /* If the buffer is full and the user defines USART_SAFE_INSERT,
                 * ignore new bytes. */
                rb_safe_insert(rb, (uint8)regs->DR);
        #else
                /* By default, push bytes around in the ring buffer. */
                rb_push_insert(rb, (uint8)regs->DR);
        #endif
    }

}

How to choose:

The first patch has been tested and confirmed to be working by most users. If you are not sure of which to apply choose that
The second patch is what I have been using myself recently, since it has a slightly higher probability to catch a new line character than the first patch. Use if you are adventurous

minosg commented 3 years ago

In addition to that the ExtUI::getProgress_percent() appears to be invoking MarlinUI::progress_t MarlinUI::_get_progress() which when printing from SD will be calling this slow method

        #if ENABLED(SDSUPPORT)
          ?: TERN(HAS_PRINT_PROGRESS_PERMYRIAD, card.permyriadDone(), card.percentDone())
        #endif

I could be wrong since I am not familiar with UI but it appears as if the code is invoking an SD access method which is relatively slow, and hoping that the TMC stepper interrupt will not trigger in between?

boelle commented 3 years ago

@minosg still an issue?

minosg commented 3 years ago

Yes that's still an issue.

The core appears to be dying by recursive interrupt uart calls. It's not something that can be fixed by a config change.

I am using this ticket to provide more technical info of the issue, or we will keep on seeing random my printer freezes bugs.

I will be tracing that behaviour next week. If there is a limited time a ticket is allowed to be open feel free to close it and we can continue to chat on it anyway. Thanks

boelle commented 3 years ago

Please test the bugfix-2.0.x branch to see where it stands.

minosg commented 3 years ago

@boelle I have tested with latest bugfix and the issue is still here.

minosg commented 3 years ago

Theory 3:

I have been testing this board using a debugger. The original theory that the LR Is corrupted seems less likely and I believe it is more of an interrupt firing issue.

I believe it was introduced by this pr https://github.com/MarlinFirmware/Marlin/pull/14030 , but need deeper understanding of the stepper class.

What happens when the printer freezes.

A freeze will become evident when obviously the printer is no longer moving. The logic is NOT stuck in exception mode, and the interrupt will be returning to the main thread. The code snippet that is of interest is this part

https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/planner.h

FORCE_INLINE static uint8_t moves_free() { return BLOCK_BUFFER_SIZE -1 - movesplanned(); }
FORCE_INLINE static uint8_t movesplanned() { return BLOCK_MOD(block_buffer_head - block_buffer_tail); }
FORCE_INLINE static block_t* get_next_free_block(uint8_t &next_buffer_head, const uint8_t count=1) {

  // Wait until there are enough slots free
  while (moves_free() < count) { idle(); }

  // Return the first available block
  next_buffer_head = next_block_index(block_buffer_head);
  return &block_buffer[block_buffer_head];
}

Which causes the UI update loop to trigger, and the lcd uart to keep on firing. The problem appears to be that this while loop is not breaking .

Chasing the logic the block_tail is only updated in the following method loop which appears is no longer firing

Stepper::pulse_phase_isr()-> discard_current_block -> release_current_block Stepper::block_phase_isr()-> discard_current_block -> release_current_block

The interrupt configuration on this platform is the following

TMC_STepper SoftwareSerial Timer4
STEP_TIMER_NUM             Timer5
PULSE_TIMER_NUM            Timer5
TEMP_TIMER_NUM             Timer2
SERVO0_TIMER_NUM           Timer8

NVIC_SetPriorityGrouping(0x3);
#define STEP_TIMER_IRQ_PRIO 1
#define TEMP_TIMER_IRQ_PRIO 2

Which appear to be enabled during the time of the freeze

NVIC-ISER0 0x501001c0
NVIC-ISER1 0x00044160

And the interrupt priorities appear to be in order

IPR7_N0 ->  0x20
IPR12_N2-> 0x10
IPR9_N1 -> 0xd0

I have also checked timer5 and the CC register as well as the count register and the timer appears to be counting as expected.

Further notes

This platform appears to not be using instruction and memory barriers for interrupts. I do not know if there is a design choice behind that, but the pull request I quoted is enabling and disabling timers using CPSID, CPSIE instructions, with no guarantees that they are actually in the state that we want them to.

I am not certain on the priority grouping setting. Considering that I cannot see any use of sub-priorities in Marlin Core, are they being used in Mapple or should it be set to 0x4 anyway? Is that code a snippet of when we were using Autopreload with the timers instead of resetting them?

When the board freezes most interrupts but the lcd and the temperature stop firing. This is also very bizarre behavior which I am trying to track down. This is also a very happy coincidence, since if the temperature interrupts dies the hotend wil cause meltdown/fire so its a considerable safety risk. I have disabled interrupt grouping and set the uasart1 ( lcd) interrupt to have the lowest priority (15) and yet it keeps on firing, which indicates there is no interrupt preemption or starvation happening.

What I suspect happens is because of too many interrupts firing , ( TMC set in UART modes individually and not chained in a single line ) something in the stepper.cpp logic mini scheduler is falling out of its timing bounds. The problem is that when the issue happens is way before it becomes evident in the logic. so a longer trace is needed which is not supported by st-link.

To be continued.

kind3r commented 3 years ago

I think the issue is not STM32 only as I have a LPC1768 and seing similar symptoms since I added a BTT TFT 35 with Marlin mode LCD to my MKS SGEN L board with TMC2209s in UART mode.

My setup is as follows:

Octoprint running on a PI4 connected to TFT 35 (uart5 on PI to uart3 on TFT)
TFT connected to MKS SGEN L board via:
- RS232 port on the TFT to AUX1 port on SGEN L
- EXP3-CR10 port on the TFT to EXP1 port on the SGEN L

The TFT proxies the commands from the PI to the board. I have slightly modified the BTT firmware to parse the M117 commands and display them in TFT mode.

What I have discovered so far in my tests (granted not yet exhaustive):

Higher acceleration produces a more reliable crash
Marlin with LCD support with or without the LCD cable connected crashes (4 prints crashed 1h-2h in)
Marlin without LCD support seems to work fine (1 print so far 3h45m)
When the crash occurs, the serial port on the board becomes unresponsive but if I use the TFT's disconnect/reconnect option it resumes printing (if you are fast enough and Octoprint does not timeout). I am unable yet to determine what happens during the crash (if the watchdog resets the board or not). I guess if serial communication recovery was better handled in the BTT TFT you would only see the print pausing for a few seconds.

minosg commented 3 years ago

When the crash occurs, the serial port on the board becomes unresponsive but if I use the TFT's disconnect/reconnect option it resumes printing

That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts.

But the fact that even slower lcd's cause the issue, could also verify the hypothesis that the added logic on the loop, is what is pushing a timing sensitive task outside of the acceptable boundaries. Compiling out and CLD will significantly decrease the ui.update() loop's time.

Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler

kind3r commented 3 years ago

That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts.

I think this is where the disconnect happens in BTT firmware https://github.com/guruathwal/BIGTREETECH-TouchScreenFirmware/blob/c189809e95c627d1680ef81533b32e8fac56514b/TFT/src/User/Menu/SettingsMenu.c#L53

Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler

LA yes, set to 0.1 (I thought from your tests it did not really matter), acceleration is set to 1000 and speed to 90mm/s.

Starting a second 3h print without LCD support.

taragor commented 3 years ago

I have a very similar setup: SKR mini E3 V2, BTT TFT35. I had frequent crashes with #define CR10_STOCKDISPLAY enabled. From my testing it doesn't matter whether the stock Ender-3 or the TFT35 is connected. I've also had crashes with #define CR10_STOCKDISPLAY disabled, TFT35 connected (running in BTT mode) to the TFT header (SERIAL_PORT 2), and octopi connected via USB (SERIAL_PROT_2 -1). However when using only the TFT35 in BTT mode, with no octoprint I didn't have a single crash no matter how aggressive I set LA, jerk, Accelerations. I've printed 20+ hours with this setup and no issues so far. So I guess it is somewhat related to the number of serial connections aswell as the kind (stepper, display, or printhost)

I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that? Also I'm still waiting for my STLink to get into some proper debugging.

kind3r commented 3 years ago

I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that?

No special configuration just use guruathwal fork as it fixes a number of issues with uart forwarding. The only issue that still stands is the fact that because the lines sent by Octoprint start with line numbers (Nxxx) the TFT does not parse the information from them so for instance M117 texts are not shown on the display, fan speed does not update etc. I am working on a patch for this (seems to be working fine so far so I will publish this week).

kind3r commented 3 years ago

Second print without LCD support froze after 2h :( I was able to resume using the disconnect method.

p3p commented 3 years ago

@kind3r I'm not sure your issue is related to this one, as the board does not freeze (being an LPC176x the watchdog would catch that if it did so there would be no risk of the heaters being left on),

From your description you are not using the normal usb cdc serial port (which is a 10Mbit, lossless, flow controlled connection), but bridging standard uart through a tft display board, it almost seems like the gcode commands to Marlin are just stopping at some point until you reset the tfts serial connection?

do you have any way to check that there is data getting to LP176x uart while in the frozen state?

minosg commented 3 years ago

@p3p THe watchdog is what made this issue visible and this is why it had its own ticket, but inherently we are talking about the same Marlin core cause.

The key element to test for this issue is the following If your printer crashes ( watchdog reset/ freeze ) when printing with high acceleration, and having an LCD compiled.

M201 X9000 Y9000 Z500 E5000 ; sets maximum accelerations, mm/sec^2 M203 X500 Y500 Z12 E80 ; sets maximum feedrates, mm/sec M204 P1000 R1000 T1000 ; sets acceleration (P, T) and retract acceleration (R), mm/sec^2 M205 X10.00 Y10.00 Z0.20 E2.00 ; sets the jerk limits, mm/sec M205 S0 T0 ; sets the minimum extruding and travel feed rate, mm/sec

But does never freeze if there is display compiled, and you are printing from SD using M21, M24 then you are seeing the same issue.

As discussed above I think that the the planner stops discarding blocks, and the firmware freezes waiting for new free blocks. When that happens some interrupts can fire ( temperature, LCD ) while others don't.

kind3r commented 3 years ago

It also hapened while printing from the PI connected via USB (in which case I unplugged the USB cable and reconnected it but it was still frozen). I still need to run more tests to make it repeatable, but it seems similar (something uart related in either case). I'm not sure that heaters are still on as I imediatly reacted every time, but next time it happens I will leave it frozen for longer to check if the heaters are really on or not.

minosg commented 3 years ago

@kind3r Can you please try to print using SD directly and no usb host? Pronterface has a nice UI element ( SD ) to help with that. If you have a freeze, please let me know.

kind3r commented 3 years ago

@minosg Could you clarify a bit the conditions ?

Use board's SD Card
LCD support enabled or disabled ?
Can I use Octoprint's terminal to start the print ?
Do I have to disconnect all serials after starting the print or can they remain connected (so I can monitor) ?

p3p commented 3 years ago

If the issue is happening with the USB CDC connection and the hardware UART along with both STM32 and LPC then I guess I can rule out it being a low level issue (in my LPC Arduino framework) that's getting one of those into a bad state, they are very different peripherals,

@kind3r I'm not sure if your board manufacturer has messed with the default bootloader or not but LPC176x should never boot loop, so if the watchdog triggers the board will be stalled in a safe state until hardware reset, in theory that rules out an actual mcu crash at least.

minosg commented 3 years ago

@kind3r

Enable Linear advance.
Compile out LCD support. This will make the ui.update() loop faster on idle.
Put the test file on sd card on the printer board.
Connect using pronterface and use the SD button to select the file to print or do it manually using M21, M24 and filename (be warned the filesystem many need special characters to the name)
You should disconnect all serials , but I personally have not had any freezes even when connected for debugging using this mode.

@p3p For the STM case, I can confirm it happens regardless if you have a usb connected or not. By connecting the serial, or any other action which that delays the logic it causes it to happen way more often. With the ST-Link attached it happens in the first 10 minutes, while normally it takes a couple of prints to trigger.

minosg commented 3 years ago

Update on the investigation on this issue. I run a tracing session today, and the following race condition was observed.

Looking into the stepper.cpp logic in particular https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/stepper.cpp

When the freeze happens, the interrupts and the timers are enabled and running/firing at the appropriate rate.

The block_phase_isr() called the finalize state right before the printer froze

    .....
    /////
    // If current block is finished, reset pointer and finalize state
    if (step_events_completed >= step_event_count) {
                 discard_current_block();  
    }

discard_current_block() will set the pointer to null

    FORCE_INLINE static void discard_current_block() {
      current_block = nullptr;
      axis_did_move = 0;
      planner.release_current_block();
    }

Then the pulse_base_isr will fire next time and exit

void Stepper::pulse_phase_isr() {
   ......
   ......
  // If there is no current block, do nothing
  if (!current_block) {
      return;
  }

The next time the block_phase_isr() is called it will attempt to retrieve a new block and exit

uint32_t Stepper::block_phase_isr() {
....
  if (!current_block) {

    // Anything in the buffer?
    if ((current_block = planner.get_current_block())) {
    }
    return;
}

Planner uses this logic

block_t* Planner::get_current_block() {
  // Get the number of moves in the planner queue so far
  const uint8_t nr_moves = movesplanned();

  // If there are any moves queued ...
  if (nr_moves) {
   }
   return nullptr;

Moves planned will always return 0 since the head and tail of Planner:block_buffer are the same

  FORCE_INLINE static uint8_t movesplanned() { return BLOCK_MOD(block_buffer_head - block_buffer_tail); }

/* Trace of the public values of block buffer */
      block_buffer:16
      block_buffer_head:15
      block_buffer_nonbusy: 15
      block_buffer_planner: 15
      block_buffer_tail: 15
      cleaning_buffer_counter:0

This deadlocks the stepper logic.

I would assume that step_events_completed >= step_event_count is a normal condition, and should happen in the end of every loop. The question is why it never recovers form it.

kind3r commented 3 years ago

@minosg I ran the test like you said and it worked fine (from SD Card, all serials disconnected after starting the print).

I also ran another test from Octoprint via USB and it also worked fine. I think @p3p is right and it seems that my issue is related to the BTT TFT uart handling and not Marlin's since printing resumes after disconnecting and reconnecting from the TFT and I don't think Marlin restarts, it's just the TFT that stops communication.

minosg commented 3 years ago

A minor update

I have good news and bad news.

The good news. I have identified an actual bug in the planner, which is making this issue worse. The bad news, fixing this bug does not resolve the deadlock situation, just decreases it's frequency.

Planner is slicing moves as a series of micro-moves made to fit in a ring buffer ( Default size of 16 block ) .Ring buffers and interrupts are a nasty combination since a double increment when moving the head/tail can make an empty buffer appear full or the other way around.

From the code

    /**
     * The move buffer, calculated in stepper steps
     *
     * block_buffer is a ring buffer...
     *
     *             head,tail : indexes for write,read
     *            head==tail : the buffer is empty
     *            head!=tail : blocks are in the buffer
     *   head==(tail-1)%size : the buffer is full

The good news is when this was designed a logic was included to make sure that things are delayed a bit when the buffer is working at its boundaries ( ie 15 out of 16 blocks full ). This is done by delay_before_delivering

In the beginning of each move this logic is triggered which is aimed at giving 100ms to the stepper driver to move the tail before the new line starts

 // If this is the first added movement, reload the delay, otherwise, cancel it.
  if (block_buffer_head == block_buffer_tail) {
    // If it was the first queued block, restart the 1st block delivery delay, to
    // give the planner an opportunity to queue more movements and plan them
    // As there are no queued movements, the Stepper ISR will not touch this
    // variable, so there is no risk setting this here (but it MUST be done
    // before the following line!!)
    delay_before_delivering = BLOCK_DELAY_FOR_1ST_MOVE;
  }

The problem is that the stepper will call get_current_block which is implemented as such

block_t* Planner::get_current_block() {
  // Get the number of moves in the planner queue so far
  const uint8_t nr_moves = movesplanned();

  // If there are any moves queued ...
  if (nr_moves) {

    // If there is still delay of delivery of blocks running, decrement it
    if (delay_before_delivering) {
      --delay_before_delivering;
      // If the number of movements queued is less than 3, and there is still time
      //  to wait, do not deliver anything
      if (nr_moves < 3 && delay_before_delivering) return nullptr;
      delay_before_delivering = 0;
    }

The latter is equivalent to

     if (delay_before_delivering) {
           if (nr_moves >= 3) {
              delay_before_delivering = 0;
           } else {
              return nullptr;
           }
      }

Which is effectively cancelling the delay and wasting a cycle of stepper logic if it returns null pointer

I believe it was aimed to be written as

    if (delay_before_delivering) {
      --delay_before_delivering;
      // If the number of movements queued is less than 3, and there is still time
      //  to wait, do not deliver anything
      if (nr_moves < 3 && delay_before_delivering) {
          delay_before_delivering = 0;
          return nullptr;
      }
    }

It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth.

thisiskeithb commented 3 years ago

It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth.

We’re accepting PRs, so please feel free to contribute your changes.

minosg commented 3 years ago

This is a minor change, and hardly a critical one. Weather it gets contributed and attributed to me or anyone who reads that comments and adds it to their already pending PR is of little importance compared to fixing the core issue.

I currently am deep down the rabbit hole of investigating this issue which effectively does not allow you to use drivers in uart mode alongside a display or octorpint. When and if that is resolved, all side findings can be merged in a larger PR without blocking valuable maintainers` time.

kind3r commented 3 years ago

@minosg I think you are misreading the code a bit, the ideea is to have a delay of BLOCK_DELAY_FOR_1ST_MOVE cycles if there are less than 3 moves available, else, the delay is reset to 0. Your proposed change waits for only one cycle.

minosg commented 3 years ago

I could really use any help on understanding the planner code so thanks for commenting that.

I could not determine intent of this code to begin with. Why should the delay only apply for when there are less than 3 moves planned?

During a normal operation the HEAD > TAIL and moves planned is the SIZE_OFF_BUFF -1 >= 3 so it will default to zero right after first invocation effectively using no delay and proceed to deliver the block.

If the logic intended is:

If delay is set and less than 3 blocks, do not return anything, and wait for it called again to decrement the counter.
If delay is set and more or equal 3 blocks planned, wait and then deliver ( guaranteed to give you a block )

Then the above code is not doing that What it does is:

If delay is set and less than 3 blocks, do not return anything, and wait for it called again to decrement the counter. -If delay is set and more or equal 3 blocks planned, then do not delay and continue

From the comments in the code about this delay, it seems to indicate it was added to protect you from the ISR messing the head/tail logic.

    // If it was the first queued block, restart the 1st block delivery delay, to
    // give the planner an opportunity to queue more movements and plan them
    // As there are no queued movements, the Stepper ISR will not touch this
    // variable, so there is no risk setting this here (but it MUST be done
    // before the following line

thecynic commented 3 years ago

Is there a reason why the interrupt isn't disabled while non-interrupt code manipulates the ring buffer?

minosg commented 3 years ago

@thecynic I would assume this is happening for safety reasons. Disabling interrupts can be tricky when you are trying to keep some control over pwm for hotend. And not to mention not uniform across platforms ( legacy AVR controllers do it instantaneously, but in Arm you need to use memory and instruction barriers )

I think I may have a new hypothesis the root cause.

Theory 4: Queue Starvation

Currently this is the priority configuration of interrupts for the design

#define STEP_TIMER_IRQ_PRIO 1
#define TEMP_TIMER_IRQ_PRIO 2

The planner.cpp is writing to the ring buffer by accessing the head, while the stepper is consuming that data by accessing the tail. The buffer is full when head = tail -1 and Empty when head == tail

Both producer and consumer rely on get_next_free_block() and get_current_block() which inherently compares the head-tail relationship in order to determine the state of the queue.

But the planner which is the producer of the moves, is running in a lower priority than the stepper which is the consumer.

    /**
     * Planner::get_next_free_block
     *
     * - Get the next head indices (passed by reference)
     * - Wait for the number of spaces to open up in the planner
     * - Return the first head block
     */
    FORCE_INLINE static block_t* get_next_free_block(uint8_t &next_buffer_head, const uint8_t count=1) {

      // Wait until there are enough slots free
      while (moves_free() < count) { idle(); }

Let's assume the following scenario.

The planner printing a full move HEAD +1 == TAIL, of very rapid actions ( retractions, small lines etc)
It requests a get_next_free_block() which compares the queue distance.
The code is running in thread mode, so the priority of this logic is the lowest possible.
Since the queue is full MarlinCore calls idle()
The added number of interrupts of higher priority ( 4x software serial steppers ) combined with the ui.update() and host_keep_alive delay, significantly delay the return of this method,
In that time the stepper (consumer) has used that time to push all the moves out of the ring buffer resulting in HEAD=TAIL
The producer which is the planner, returns back to life after being frozen during the idle update and finds a buffer which is empty, and tries to populate the first block. He is considerably slower though, and constantly interrupted by the stepper ISRs
If the producer manages to move the head and then gets interrupted by block_phase_isr, both sides will be touching the ring buffer, or even worse he can get interrupted when performing memset operations during block_sync.
From this moment on the system is in undefined behavior territory. Two sides race for a common entry on the buffer, but the stepper isr's fire way more often and can preempt the planner since they are called in Handler Mode.

Also there is no telling is that interferes with any of the other interrupts in the system, blocking the idle() of core and effectively killing the thread logic.

This explains more of the issues users has been experiencing as well all the steps to reproduce it.

Enabling a display and host_keepalive causes a freeze ( Adding two non deterministic delays to Idle )
Disabling a display and printing through octoprint with host_keepalive causes a freeze ( Adding two interrupt and a delay to Idle)
Disabling display and printing from SD card with or without host_keeplive causes no freeze ( only adding a slight delay, but disabling the screen reduces overhead significantly. probably making the return faster )
Adding a 100ms delay in the Idle() logic of marlin core causes the freeze
Enabling LA and printing models with small curves causes the freeze more often ( basically playing the ring buffer like a harp )

lexter12 commented 3 years ago

@minosg I have a new SKR mini V2 with the latest bugfix, can't print more than 5 minutes without hitting the freeze (about 5 layers of the XYZ calibration cube). it's VERY consistent, so if you need me to do testing also I can as it's very quick to figure out if it's working or not...

minosg commented 3 years ago

@lexter12 Thanks for offering. I have been looking into it for over a month now, because this issue is pretty much guaranteed to occur with anyone running TMC steppers in UART. It does not matter if your board is brand new or older. I do not have a btt mini v2 but I doubted the new UART chain-loading design they introduced would fix a software issue.

I have a a new theory but before that could you elaborate on your setup? What LCD controller are you using? Are you printing using octoprint? And do you happen to own a st-link for debugging?

taragor commented 3 years ago

@minosg My st-link just arrived yesterday. Hit me up if you need some testing with the skr mini v2.

lexter12 commented 3 years ago

@minosg I just replaced the board on my Anet A3 with an SKR mini E3 V2, and a TFT32 V3, connected with CR10_STOCKDSPLAY and serial. I also have an SKR Turbo 1.4 with TMC2208 in UART mode on a different printer which causes no issues (With a TFT35 E3 V3 connected as REPRAP_DISCOUNT_FULL_GRAPHIC_SMART_CONTROLLER using EXP1 & 2) The only differences between these printers (Aside from the CPU) are the PSUs (the mini is connected to 12V while the Turbo to 24V) and the accelerations and speeds on the turbo are configured much lower. I don't have an st-link... Thanks.

minosg commented 3 years ago

@taragor can you set up your dev environment ready? You need to keep the code size under 256K in size or debug won't start. Disable progmem heavy features for this. One you are set up start a debug session and the print and copy the call stack.

If you see it frozen in a chain of get_next_block - > uart_isr I may have good news for you

minosg commented 3 years ago

@taragor To help you further with that all you need to do is add the following two lines on the platformio.ini

[env:STM32F103RC_btt_512K]
platform          = ${common_stm32f1.platform}
extends           = env:STM32F103RC_btt
board_upload.maximum_size=524288
build_flags       = ${env:STM32F103RC_btt.build_flags} -DSTM32_FLASH_SIZE=512
monitor_speed     = 115200
debug_tool        = stlink
upload_protocol   = stlink

or 
[env:STM32F103RC_btt]
platform          = ${common_stm32f1.platform}
extends           = env:STM32F103RC
extra_scripts     = buildroot/share/PlatformIO/scripts/STM32F103RC_SKR_MINI.py
build_flags       = ${common_stm32f1.build_flags}
  -DDEBUG_LEVEL=0 -DSS_TIMER=4
monitor_speed     = 115200
debug_tool        = stlink
upload_protocol   = stlink

And then set it as default environment

[platformio]
src_dir      = Marlin
boards_dir   = buildroot/share/PlatformIO/boards
default_envs = STM32F103RC_btt_512K
# or
default_envs = STM32F103RC_btt

taragor commented 3 years ago

@minosg Thanks, I'll be home soon and will start testing ASAP

taragor commented 3 years ago

@minosg I've tried starting a debug session. However I get the following error message: hla_swd Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD Info : tcl server disabled Info : telnet server disabled Info : clock speed 1000 kHz Info : STLINK V2J17S4 (API v2) VID:PID 0483:3748 Info : Target voltage: 3.217289 Error: init mode failed (unable to connect to the target)

.pioinit:13: Error in sourced command file: Remote communication error. Target disconnected.: Success.

minosg commented 3 years ago

This is a separate platformio bug which cannot handle the size of the binary. You need to keep the flash size bellow a certain threshold

Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
RAM:   [===       ]  25.8% (used 12704 bytes from 49152 bytes)
Flash: [=====     ]  45.1% (used 236620 bytes from 524288 bytes)

In my example this is 236620. I believe everything bellow 240000 will work. Disabling m503, advanced pause and Marlin and TMC debug in your configuration files will help

taragor commented 3 years ago

I've disabled ARC support and am below you capacity Advanced Memory Usage is available via "PlatformIO Home > Project Inspect" RAM: [==== ] 36.0% (used 17688 bytes from 49152 bytes) Flash: [==== ] 43.7% (used 229332 bytes from 524288 bytes)

minosg commented 3 years ago

Can you verify that you can connect to the chip using the St link utility? You may need drivers or have a connection issue

taragor commented 3 years ago

The St link utility won't connect either, might be due to the fact that I'm using a 5m usb extension from my workstation to the printer. I'll set up a notebook and test again

taragor commented 3 years ago

After further testing with a laptop I still can't connect, neiter using PlatformIO nor ST-Link uility. I'm guessing that that my ST link is broken, since I've got the following behaviour: -The printer resets when I plug the STLink into USB -The blue led on the STLink turns off as soon as I try to connect via ST-Link Utility. -When I connect the STLink to the mainboard with USB connected (connection order: RST, SWCLK, GND, SWDIO) the board starts to boot as soon as I connect SWDIO

I'll just order another STLink except someone can confirm that this is normal behavior

minosg commented 3 years ago

@taragor I would not order a new device, I strongly believe there will be no need for that. Those things wont work when you want them to under pressure. It is most likely a wrong connection in the SWD connector order or no power.

Instead you can be the first to try what I believe is an active workaround for this issue and spend your time testing if that fixes your issue. I will explain the logic later today

Locate this file C:\Users\YOURUSERNAME.platformio\packages\framework-arduinoststm32-maple\STM32F1\system\libmaple\usart_private.h

And change the hardware uart callback to the following:

static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */
    volatile uint8_t moufa;
    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_ORE)) {
        moufa = (uint8)regs->DR;
    }
}

Effectively I want you to add a volatile variable, with any name that you like and add an if statement in the end which uses that random variable.

Please report if that fixed your issues or if you got even a single freeze

lexter12 commented 3 years ago

@minosg , I can confirm after countless xyz cube failures (can't get past layer 5-10) that changing nothing but connecting with repetier host and printing from SD (with screen still connected and no changes in configuration) finishes successfully. I will try the code change you shared above and try to print from the screen again (I have not yet had a single successful print before the test with repetier) if it succeeds, it definitely is the right direction. Will test and get back soon. Thanks.

taragor commented 3 years ago

I'll be sure to try that, but I can't see what you are achieving with that. The only thing I can think of is that the additional operation flushes a pipeline or might be changing something in a speculative execution unit. Do you think the issue is related to that? Yet I might be misinterpreting your Intention here, since I'm not too familiar with M3 architecture

Edit: after reading the explanation it makes sense

minosg commented 3 years ago

Theory 5 and Workaround ( Eureca )

As it turns out, the issue appears to be a combination of Theories 1, 2 and 4.

It is triggered by nested interrupts, it is caused by incorrect handling of the USART IRQ register, and is expressing the failure by causing a planner queue starvation.

As discussed before the planner.cpp is populating the Queue of the movements for the printer in Thread mode, which is lower priority than the Handler mode used by the stepper, in order to consume the blocks. The issue however is caused by a critical section in USART handler in libmaple, which expects to be completed in a short deterministic time, without another interrupt firing in the meantime.

How it breaks

When the planner calls idle(), and the LCD attempts to communicate with the board, or when the host-keepalive message is to be transmitted, the hardware UART in both of those interfaces freezes the thread mode code, and jumps to handler mode.
In the handler mode there is small delay between clearing the interrupt flag, accessing the SR register, and reading the value in the DR ( Data register ).
If at this precise moment, any other interrupt fires ( Stepper, Systick, Linear advance, block, phase ) because they have higher priorities, they will pause the USART handler and do their stuff.
If when the USART handler is frozen and the display tries to send something to the STM32, it will assert the Overrun Flag, since two bytes have been received, but not read from the DR.
Asserting the ORE bit means the USART will fire instantaneously, right after it completes, and the process will repeat, since the channel has fallen out of sync, never returning to thread mode.
Several moves later, the stepper interrupts will have consumed the queue and the printer will freeze, waiting for a frozen planner to clock more blocks in. Because watchdog is disabled #18226 it will remain in printing mode with heaters on

Pointing out the critical section in the handler ore

Page 819 of STM document RM0008 rev 15 referenced by libmaple

ORE: Overrun error
This bit is set by hardware when the word currently being received in the shift register is
ready to be transferred into the RDR register while RXNE=1. An interrupt is generated if
RXNEIE=1 in the USART_CR1 register. It is cleared by a software sequence (an read to the
USART_SR register followed by a read to the USART_DR register).
0: No Overrun error
1: Overrun error is detected
Note: When this bit is set, the RDR register content will not be lost but the shift register will be
overwritten. An interrupt is generated on ORE flag in case of Multi Buffer
communication if the EIE bit is set.

The workaround The idea behind the workaround is primitive, and be warned it is not aimed to be a proper fix. We assume that when a UART overruns, that the BYTE is lost! and clear the interrupt flag. Using a random volatile variable, ensures that the compiler will not optimize the logic out.

Please refer to the previous post for the workaround snippet

A proper fix This is not directly a Marlin bug, and should be fixed by the Libmaple developers. Marlin's logic can be improved to make it more resilient to such issues.

Planner should not be directly calling IDLE in the logic which populates the Queue. Even better Planner()::get_next_block() could be full asynchronous.
The queue producer should be running in higher exception level than the LCD but lower than thermal and stepper. That way even bad EXUI implementations would not block the critical task which is to orchestrate the print.
The stepper and pulse Interrupt frequencies can be adjusted so they are less likely or completely unlikely to clash with common baud rates for displays

All of the above are major tasks, and cannot be done overnight. In the meantime patching the library will allow users to use their brand new boards.

lexter12 commented 3 years ago

@minosg First successful print! Nothing changed except your fix in the libmaple which seems to have solved the problem! I will now continue printing longer prints to make sure it doesn't happen over long periods, but seems your workaround is confirmed - until libmaple is fixed, this workaround should let people use their new boards as you said. Thanks!

minosg commented 3 years ago

@Lexer I have been printing geckos since last night. I have enabled aggressive acceleration, LA, connected a malyanlcd which has the higher baud rate, and added code to spam the serial over usb for stressing. I'm currently at 12 successful prints and no failures.

What we need now is more users testing it with different configurations and someone using octoprint to verify that it's a proper workaround.

taragor commented 3 years ago

I've just finished the first Gecko with high acceleration. Worked flawlessly. I'll test more tomorrow

lexter12 commented 3 years ago

@minosg I have octoprint just haven't been using it since I just replaced motherboards to the SKR mini. I'll connect it and give it a try - anything special I need to be testing?

minosg commented 3 years ago

Printing with octoprint and host keep alive should cause a freeze. If not we are in the good path

MarlinFirmware / Marlin

Nested interrupts corrupt LR ( AKA printer freezes mid print with heaters on ) #18358

Bug Description

18117

18315

17161

18356

18226

My Configurations

Steps to Reproduce

Additional Information

Workaround/Patches

Theory 3:

What happens when the printer freezes.

Further notes

Theory 5 and Workaround ( Eureca )