Closed minosg closed 3 years ago
In addition to that the ExtUI::getProgress_percent()
appears to be invoking MarlinUI::progress_t MarlinUI::_get_progress()
which when printing from SD will be calling this slow method
#if ENABLED(SDSUPPORT)
?: TERN(HAS_PRINT_PROGRESS_PERMYRIAD, card.permyriadDone(), card.percentDone())
#endif
I could be wrong since I am not familiar with UI but it appears as if the code is invoking an SD access method which is relatively slow, and hoping that the TMC stepper interrupt will not trigger in between?
@minosg still an issue?
Yes that's still an issue.
The core appears to be dying by recursive interrupt uart calls. It's not something that can be fixed by a config change.
I am using this ticket to provide more technical info of the issue, or we will keep on seeing random my printer freezes bugs.
I will be tracing that behaviour next week. If there is a limited time a ticket is allowed to be open feel free to close it and we can continue to chat on it anyway. Thanks
Please test the bugfix-2.0.x branch to see where it stands.
@boelle I have tested with latest bugfix and the issue is still here.
I have been testing this board using a debugger. The original theory that the LR Is corrupted seems less likely and I believe it is more of an interrupt firing issue.
I believe it was introduced by this pr https://github.com/MarlinFirmware/Marlin/pull/14030 , but need deeper understanding of the stepper class.
A freeze will become evident when obviously the printer is no longer moving. The logic is NOT stuck in exception mode, and the interrupt will be returning to the main thread. The code snippet that is of interest is this part
https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/planner.h
FORCE_INLINE static uint8_t moves_free() { return BLOCK_BUFFER_SIZE -1 - movesplanned(); }
FORCE_INLINE static uint8_t movesplanned() { return BLOCK_MOD(block_buffer_head - block_buffer_tail); }
FORCE_INLINE static block_t* get_next_free_block(uint8_t &next_buffer_head, const uint8_t count=1) {
// Wait until there are enough slots free
while (moves_free() < count) { idle(); }
// Return the first available block
next_buffer_head = next_block_index(block_buffer_head);
return &block_buffer[block_buffer_head];
}
Which causes the UI update loop to trigger, and the lcd uart to keep on firing. The problem appears to be that this while loop is not breaking .
Chasing the logic the block_tail is only updated in the following method loop which appears is no longer firing
Stepper::pulse_phase_isr()-> discard_current_block -> release_current_block Stepper::block_phase_isr()-> discard_current_block -> release_current_block
The interrupt configuration on this platform is the following
TMC_STepper SoftwareSerial Timer4
STEP_TIMER_NUM Timer5
PULSE_TIMER_NUM Timer5
TEMP_TIMER_NUM Timer2
SERVO0_TIMER_NUM Timer8
NVIC_SetPriorityGrouping(0x3);
#define STEP_TIMER_IRQ_PRIO 1
#define TEMP_TIMER_IRQ_PRIO 2
Which appear to be enabled during the time of the freeze
NVIC-ISER0 0x501001c0
NVIC-ISER1 0x00044160
And the interrupt priorities appear to be in order
IPR7_N0 -> 0x20
IPR12_N2-> 0x10
IPR9_N1 -> 0xd0
I have also checked timer5 and the CC register as well as the count register and the timer appears to be counting as expected.
This platform appears to not be using instruction and memory barriers for interrupts. I do not know if there is a design choice behind that, but the pull request I quoted is enabling and disabling timers using CPSID, CPSIE instructions, with no guarantees that they are actually in the state that we want them to.
I am not certain on the priority grouping setting. Considering that I cannot see any use of sub-priorities in Marlin Core, are they being used in Mapple or should it be set to 0x4 anyway? Is that code a snippet of when we were using Autopreload with the timers instead of resetting them?
When the board freezes most interrupts but the lcd and the temperature stop firing. This is also very bizarre behavior which I am trying to track down. This is also a very happy coincidence, since if the temperature interrupts dies the hotend wil cause meltdown/fire so its a considerable safety risk. I have disabled interrupt grouping and set the uasart1 ( lcd) interrupt to have the lowest priority (15) and yet it keeps on firing, which indicates there is no interrupt preemption or starvation happening.
What I suspect happens is because of too many interrupts firing , ( TMC set in UART modes individually and not chained in a single line ) something in the stepper.cpp logic mini scheduler is falling out of its timing bounds. The problem is that when the issue happens is way before it becomes evident in the logic. so a longer trace is needed which is not supported by st-link.
To be continued.
I think the issue is not STM32 only as I have a LPC1768 and seing similar symptoms since I added a BTT TFT 35 with Marlin mode LCD to my MKS SGEN L board with TMC2209s in UART mode.
My setup is as follows:
uart5
on PI to uart3
on TFT)RS232
port on the TFT to AUX1
port on SGEN LEXP3-CR10
port on the TFT to EXP1
port on the SGEN LThe TFT proxies the commands from the PI to the board. I have slightly modified the BTT firmware to parse the M117 commands and display them in TFT mode.
What I have discovered so far in my tests (granted not yet exhaustive):
When the crash occurs, the serial port on the board becomes unresponsive but if I use the TFT's disconnect/reconnect option it resumes printing
That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts.
But the fact that even slower lcd's cause the issue, could also verify the hypothesis that the added logic on the loop, is what is pushing a timing sensitive task outside of the acceptable boundaries. Compiling out and CLD will significantly decrease the ui.update() loop's time.
Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler
That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts.
I think this is where the disconnect happens in BTT firmware https://github.com/guruathwal/BIGTREETECH-TouchScreenFirmware/blob/c189809e95c627d1680ef81533b32e8fac56514b/TFT/src/User/Menu/SettingsMenu.c#L53
Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler
LA yes, set to 0.1 (I thought from your tests it did not really matter), acceleration is set to 1000 and speed to 90mm/s.
Starting a second 3h print without LCD support.
I have a very similar setup: SKR mini E3 V2, BTT TFT35. I had frequent crashes with #define CR10_STOCKDISPLAY
enabled. From my testing it doesn't matter whether the stock Ender-3 or the TFT35 is connected.
I've also had crashes with #define CR10_STOCKDISPLAY
disabled, TFT35 connected (running in BTT mode) to the TFT header (SERIAL_PORT 2
), and octopi connected via USB (SERIAL_PROT_2 -1
).
However when using only the TFT35 in BTT mode, with no octoprint I didn't have a single crash no matter how aggressive I set LA, jerk, Accelerations. I've printed 20+ hours with this setup and no issues so far.
So I guess it is somewhat related to the number of serial connections aswell as the kind (stepper, display, or printhost)
I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that? Also I'm still waiting for my STLink to get into some proper debugging.
I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that?
No special configuration just use guruathwal fork as it fixes a number of issues with uart forwarding. The only issue that still stands is the fact that because the lines sent by Octoprint start with line numbers (Nxxx) the TFT does not parse the information from them so for instance M117
texts are not shown on the display, fan speed does not update etc. I am working on a patch for this (seems to be working fine so far so I will publish this week).
Second print without LCD support froze after 2h :( I was able to resume using the disconnect method.
@kind3r I'm not sure your issue is related to this one, as the board does not freeze (being an LPC176x the watchdog would catch that if it did so there would be no risk of the heaters being left on),
From your description you are not using the normal usb cdc serial port (which is a 10Mbit, lossless, flow controlled connection), but bridging standard uart through a tft display board, it almost seems like the gcode commands to Marlin are just stopping at some point until you reset the tfts serial connection?
do you have any way to check that there is data getting to LP176x uart while in the frozen state?
@p3p THe watchdog is what made this issue visible and this is why it had its own ticket, but inherently we are talking about the same Marlin core cause.
The key element to test for this issue is the following If your printer crashes ( watchdog reset/ freeze ) when printing with high acceleration, and having an LCD compiled.
M201 X9000 Y9000 Z500 E5000 ; sets maximum accelerations, mm/sec^2 M203 X500 Y500 Z12 E80 ; sets maximum feedrates, mm/sec M204 P1000 R1000 T1000 ; sets acceleration (P, T) and retract acceleration (R), mm/sec^2 M205 X10.00 Y10.00 Z0.20 E2.00 ; sets the jerk limits, mm/sec M205 S0 T0 ; sets the minimum extruding and travel feed rate, mm/sec
But does never freeze if there is display compiled, and you are printing from SD using M21, M24 then you are seeing the same issue.
As discussed above I think that the the planner stops discarding blocks, and the firmware freezes waiting for new free blocks. When that happens some interrupts can fire ( temperature, LCD ) while others don't.
It also hapened while printing from the PI connected via USB (in which case I unplugged the USB cable and reconnected it but it was still frozen). I still need to run more tests to make it repeatable, but it seems similar (something uart related in either case). I'm not sure that heaters are still on as I imediatly reacted every time, but next time it happens I will leave it frozen for longer to check if the heaters are really on or not.
@kind3r Can you please try to print using SD directly and no usb host? Pronterface has a nice UI element ( SD ) to help with that. If you have a freeze, please let me know.
@minosg Could you clarify a bit the conditions ?
If the issue is happening with the USB CDC connection and the hardware UART along with both STM32 and LPC then I guess I can rule out it being a low level issue (in my LPC Arduino framework) that's getting one of those into a bad state, they are very different peripherals,
@kind3r I'm not sure if your board manufacturer has messed with the default bootloader or not but LPC176x should never boot loop, so if the watchdog triggers the board will be stalled in a safe state until hardware reset, in theory that rules out an actual mcu crash at least.
@kind3r
@p3p For the STM case, I can confirm it happens regardless if you have a usb connected or not. By connecting the serial, or any other action which that delays the logic it causes it to happen way more often. With the ST-Link attached it happens in the first 10 minutes, while normally it takes a couple of prints to trigger.
Update on the investigation on this issue. I run a tracing session today, and the following race condition was observed.
Looking into the stepper.cpp logic in particular https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/stepper.cpp
When the freeze happens, the interrupts and the timers are enabled and running/firing at the appropriate rate.
The block_phase_isr() called the finalize state right before the printer froze
.....
/////
// If current block is finished, reset pointer and finalize state
if (step_events_completed >= step_event_count) {
discard_current_block();
}
discard_current_block() will set the pointer to null
FORCE_INLINE static void discard_current_block() {
current_block = nullptr;
axis_did_move = 0;
planner.release_current_block();
}
Then the pulse_base_isr will fire next time and exit
void Stepper::pulse_phase_isr() {
......
......
// If there is no current block, do nothing
if (!current_block) {
return;
}
The next time the block_phase_isr() is called it will attempt to retrieve a new block and exit
uint32_t Stepper::block_phase_isr() {
....
if (!current_block) {
// Anything in the buffer?
if ((current_block = planner.get_current_block())) {
}
return;
}
Planner uses this logic
block_t* Planner::get_current_block() {
// Get the number of moves in the planner queue so far
const uint8_t nr_moves = movesplanned();
// If there are any moves queued ...
if (nr_moves) {
}
return nullptr;
Moves planned will always return 0 since the head and tail of Planner:block_buffer are the same
FORCE_INLINE static uint8_t movesplanned() { return BLOCK_MOD(block_buffer_head - block_buffer_tail); }
/* Trace of the public values of block buffer */
block_buffer:16
block_buffer_head:15
block_buffer_nonbusy: 15
block_buffer_planner: 15
block_buffer_tail: 15
cleaning_buffer_counter:0
This deadlocks the stepper logic.
I would assume that step_events_completed >= step_event_count is a normal condition, and should happen in the end of every loop. The question is why it never recovers form it.
@minosg I ran the test like you said and it worked fine (from SD Card, all serials disconnected after starting the print).
I also ran another test from Octoprint via USB and it also worked fine. I think @p3p is right and it seems that my issue is related to the BTT TFT uart handling and not Marlin's since printing resumes after disconnecting and reconnecting from the TFT and I don't think Marlin restarts, it's just the TFT that stops communication.
A minor update
I have good news and bad news.
The good news. I have identified an actual bug in the planner, which is making this issue worse. The bad news, fixing this bug does not resolve the deadlock situation, just decreases it's frequency.
Planner is slicing moves as a series of micro-moves made to fit in a ring buffer ( Default size of 16 block ) .Ring buffers and interrupts are a nasty combination since a double increment when moving the head/tail can make an empty buffer appear full or the other way around.
From the code
/**
* The move buffer, calculated in stepper steps
*
* block_buffer is a ring buffer...
*
* head,tail : indexes for write,read
* head==tail : the buffer is empty
* head!=tail : blocks are in the buffer
* head==(tail-1)%size : the buffer is full
The good news is when this was designed a logic was included to make sure that things are delayed a bit when the buffer is working at its boundaries ( ie 15 out of 16 blocks full ). This is done by delay_before_delivering
In the beginning of each move this logic is triggered which is aimed at giving 100ms to the stepper driver to move the tail before the new line starts
// If this is the first added movement, reload the delay, otherwise, cancel it.
if (block_buffer_head == block_buffer_tail) {
// If it was the first queued block, restart the 1st block delivery delay, to
// give the planner an opportunity to queue more movements and plan them
// As there are no queued movements, the Stepper ISR will not touch this
// variable, so there is no risk setting this here (but it MUST be done
// before the following line!!)
delay_before_delivering = BLOCK_DELAY_FOR_1ST_MOVE;
}
The problem is that the stepper will call get_current_block which is implemented as such
block_t* Planner::get_current_block() {
// Get the number of moves in the planner queue so far
const uint8_t nr_moves = movesplanned();
// If there are any moves queued ...
if (nr_moves) {
// If there is still delay of delivery of blocks running, decrement it
if (delay_before_delivering) {
--delay_before_delivering;
// If the number of movements queued is less than 3, and there is still time
// to wait, do not deliver anything
if (nr_moves < 3 && delay_before_delivering) return nullptr;
delay_before_delivering = 0;
}
The latter is equivalent to
if (delay_before_delivering) {
if (nr_moves >= 3) {
delay_before_delivering = 0;
} else {
return nullptr;
}
}
Which is effectively cancelling the delay and wasting a cycle of stepper logic if it returns null pointer
I believe it was aimed to be written as
if (delay_before_delivering) {
--delay_before_delivering;
// If the number of movements queued is less than 3, and there is still time
// to wait, do not deliver anything
if (nr_moves < 3 && delay_before_delivering) {
delay_before_delivering = 0;
return nullptr;
}
}
It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth.
It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth.
We’re accepting PRs, so please feel free to contribute your changes.
This is a minor change, and hardly a critical one. Weather it gets contributed and attributed to me or anyone who reads that comments and adds it to their already pending PR is of little importance compared to fixing the core issue.
I currently am deep down the rabbit hole of investigating this issue which effectively does not allow you to use drivers in uart mode alongside a display or octorpint. When and if that is resolved, all side findings can be merged in a larger PR without blocking valuable maintainers` time.
@minosg I think you are misreading the code a bit, the ideea is to have a delay of BLOCK_DELAY_FOR_1ST_MOVE
cycles if there are less than 3 moves available, else, the delay is reset to 0. Your proposed change waits for only one cycle.
I could really use any help on understanding the planner code so thanks for commenting that.
I could not determine intent of this code to begin with. Why should the delay only apply for when there are less than 3 moves planned?
During a normal operation the HEAD > TAIL and moves planned is the SIZE_OFF_BUFF -1 >= 3 so it will default to zero right after first invocation effectively using no delay and proceed to deliver the block.
If the logic intended is:
Then the above code is not doing that What it does is:
From the comments in the code about this delay, it seems to indicate it was added to protect you from the ISR messing the head/tail logic.
// If it was the first queued block, restart the 1st block delivery delay, to
// give the planner an opportunity to queue more movements and plan them
// As there are no queued movements, the Stepper ISR will not touch this
// variable, so there is no risk setting this here (but it MUST be done
// before the following line
Is there a reason why the interrupt isn't disabled while non-interrupt code manipulates the ring buffer?
@thecynic I would assume this is happening for safety reasons. Disabling interrupts can be tricky when you are trying to keep some control over pwm for hotend. And not to mention not uniform across platforms ( legacy AVR controllers do it instantaneously, but in Arm you need to use memory and instruction barriers )
I think I may have a new hypothesis the root cause.
Theory 4: Queue Starvation
Currently this is the priority configuration of interrupts for the design
#define STEP_TIMER_IRQ_PRIO 1
#define TEMP_TIMER_IRQ_PRIO 2
The planner.cpp is writing to the ring buffer by accessing the head, while the stepper is consuming that data by accessing the tail. The buffer is full when head = tail -1
and Empty when head == tail
Both producer and consumer rely on get_next_free_block()
and get_current_block()
which inherently compares the head-tail relationship in order to determine the state of the queue.
But the planner which is the producer of the moves, is running in a lower priority than the stepper which is the consumer.
/**
* Planner::get_next_free_block
*
* - Get the next head indices (passed by reference)
* - Wait for the number of spaces to open up in the planner
* - Return the first head block
*/
FORCE_INLINE static block_t* get_next_free_block(uint8_t &next_buffer_head, const uint8_t count=1) {
// Wait until there are enough slots free
while (moves_free() < count) { idle(); }
Let's assume the following scenario.
Also there is no telling is that interferes with any of the other interrupts in the system, blocking the idle() of core and effectively killing the thread logic.
This explains more of the issues users has been experiencing as well all the steps to reproduce it.
@minosg I have a new SKR mini V2 with the latest bugfix, can't print more than 5 minutes without hitting the freeze (about 5 layers of the XYZ calibration cube). it's VERY consistent, so if you need me to do testing also I can as it's very quick to figure out if it's working or not...
@lexter12 Thanks for offering. I have been looking into it for over a month now, because this issue is pretty much guaranteed to occur with anyone running TMC steppers in UART. It does not matter if your board is brand new or older. I do not have a btt mini v2 but I doubted the new UART chain-loading design they introduced would fix a software issue.
I have a a new theory but before that could you elaborate on your setup? What LCD controller are you using? Are you printing using octoprint? And do you happen to own a st-link for debugging?
@minosg My st-link just arrived yesterday. Hit me up if you need some testing with the skr mini v2.
@minosg I just replaced the board on my Anet A3 with an SKR mini E3 V2, and a TFT32 V3, connected with CR10_STOCKDSPLAY and serial. I also have an SKR Turbo 1.4 with TMC2208 in UART mode on a different printer which causes no issues (With a TFT35 E3 V3 connected as REPRAP_DISCOUNT_FULL_GRAPHIC_SMART_CONTROLLER using EXP1 & 2) The only differences between these printers (Aside from the CPU) are the PSUs (the mini is connected to 12V while the Turbo to 24V) and the accelerations and speeds on the turbo are configured much lower. I don't have an st-link... Thanks.
@taragor can you set up your dev environment ready? You need to keep the code size under 256K in size or debug won't start. Disable progmem heavy features for this. One you are set up start a debug session and the print and copy the call stack.
If you see it frozen in a chain of get_next_block - > uart_isr I may have good news for you
@taragor To help you further with that all you need to do is add the following two lines on the platformio.ini
[env:STM32F103RC_btt_512K]
platform = ${common_stm32f1.platform}
extends = env:STM32F103RC_btt
board_upload.maximum_size=524288
build_flags = ${env:STM32F103RC_btt.build_flags} -DSTM32_FLASH_SIZE=512
monitor_speed = 115200
debug_tool = stlink
upload_protocol = stlink
or
[env:STM32F103RC_btt]
platform = ${common_stm32f1.platform}
extends = env:STM32F103RC
extra_scripts = buildroot/share/PlatformIO/scripts/STM32F103RC_SKR_MINI.py
build_flags = ${common_stm32f1.build_flags}
-DDEBUG_LEVEL=0 -DSS_TIMER=4
monitor_speed = 115200
debug_tool = stlink
upload_protocol = stlink
And then set it as default environment
[platformio]
src_dir = Marlin
boards_dir = buildroot/share/PlatformIO/boards
default_envs = STM32F103RC_btt_512K
# or
default_envs = STM32F103RC_btt
@minosg Thanks, I'll be home soon and will start testing ASAP
@minosg I've tried starting a debug session. However I get the following error message:
hla_swd
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
Info : tcl server disabled
Info : telnet server disabled
Info : clock speed 1000 kHz
Info : STLINK V2J17S4 (API v2) VID:PID 0483:3748
Info : Target voltage: 3.217289
Error: init mode failed (unable to connect to the target)
.pioinit:13: Error in sourced command file:
Remote communication error. Target disconnected.: Success.
This is a separate platformio bug which cannot handle the size of the binary. You need to keep the flash size bellow a certain threshold
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
RAM: [=== ] 25.8% (used 12704 bytes from 49152 bytes)
Flash: [===== ] 45.1% (used 236620 bytes from 524288 bytes)
In my example this is 236620. I believe everything bellow 240000 will work. Disabling m503, advanced pause and Marlin and TMC debug in your configuration files will help
I've disabled ARC support and am below you capacity
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect" RAM: [==== ] 36.0% (used 17688 bytes from 49152 bytes) Flash: [==== ] 43.7% (used 229332 bytes from 524288 bytes)
Can you verify that you can connect to the chip using the St link utility? You may need drivers or have a connection issue
The St link utility won't connect either, might be due to the fact that I'm using a 5m usb extension from my workstation to the printer. I'll set up a notebook and test again
After further testing with a laptop I still can't connect, neiter using PlatformIO nor ST-Link uility. I'm guessing that that my ST link is broken, since I've got the following behaviour: -The printer resets when I plug the STLink into USB -The blue led on the STLink turns off as soon as I try to connect via ST-Link Utility. -When I connect the STLink to the mainboard with USB connected (connection order: RST, SWCLK, GND, SWDIO) the board starts to boot as soon as I connect SWDIO
I'll just order another STLink except someone can confirm that this is normal behavior
@taragor I would not order a new device, I strongly believe there will be no need for that. Those things wont work when you want them to under pressure. It is most likely a wrong connection in the SWD connector order or no power.
Instead you can be the first to try what I believe is an active workaround for this issue and spend your time testing if that fixes your issue. I will explain the logic later today
Locate this file C:\Users\YOURUSERNAME.platformio\packages\framework-arduinoststm32-maple\STM32F1\system\libmaple\usart_private.h
And change the hardware uart callback to the following:
static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
/* Handling RXNEIE and TXEIE interrupts.
* RXNE signifies availability of a byte in DR.
*
* See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
* We enable RXNEIE. */
volatile uint8_t moufa;
if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
/* If the buffer is full and the user defines USART_SAFE_INSERT,
* ignore new bytes. */
rb_safe_insert(rb, (uint8)regs->DR);
#else
/* By default, push bytes around in the ring buffer. */
rb_push_insert(rb, (uint8)regs->DR);
#endif
}
/* TXE signifies readiness to send a byte to DR. */
if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
if (!rb_is_empty(wb))
regs->DR=rb_remove(wb);
else
regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
}
if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_ORE)) {
moufa = (uint8)regs->DR;
}
}
Effectively I want you to add a volatile variable, with any name that you like and add an if statement in the end which uses that random variable.
Please report if that fixed your issues or if you got even a single freeze
@minosg , I can confirm after countless xyz cube failures (can't get past layer 5-10) that changing nothing but connecting with repetier host and printing from SD (with screen still connected and no changes in configuration) finishes successfully. I will try the code change you shared above and try to print from the screen again (I have not yet had a single successful print before the test with repetier) if it succeeds, it definitely is the right direction. Will test and get back soon. Thanks.
I'll be sure to try that, but I can't see what you are achieving with that. The only thing I can think of is that the additional operation flushes a pipeline or might be changing something in a speculative execution unit. Do you think the issue is related to that? Yet I might be misinterpreting your Intention here, since I'm not too familiar with M3 architecture
Edit: after reading the explanation it makes sense
As it turns out, the issue appears to be a combination of Theories 1, 2 and 4.
It is triggered by nested interrupts, it is caused by incorrect handling of the USART IRQ register, and is expressing the failure by causing a planner queue starvation.
As discussed before the planner.cpp is populating the Queue of the movements for the printer in Thread mode, which is lower priority than the Handler mode used by the stepper, in order to consume the blocks. The issue however is caused by a critical section in USART handler in libmaple, which expects to be completed in a short deterministic time, without another interrupt firing in the meantime.
How it breaks
idle()
, and the LCD attempts to communicate with the board, or when the host-keepalive message is to be transmitted, the hardware UART in both of those interfaces freezes the thread mode code, and jumps to handler mode. Pointing out the critical section in the handler
Page 819 of STM document RM0008 rev 15 referenced by libmaple
ORE: Overrun error
This bit is set by hardware when the word currently being received in the shift register is
ready to be transferred into the RDR register while RXNE=1. An interrupt is generated if
RXNEIE=1 in the USART_CR1 register. It is cleared by a software sequence (an read to the
USART_SR register followed by a read to the USART_DR register).
0: No Overrun error
1: Overrun error is detected
Note: When this bit is set, the RDR register content will not be lost but the shift register will be
overwritten. An interrupt is generated on ORE flag in case of Multi Buffer
communication if the EIE bit is set.
The workaround The idea behind the workaround is primitive, and be warned it is not aimed to be a proper fix. We assume that when a UART overruns, that the BYTE is lost! and clear the interrupt flag. Using a random volatile variable, ensures that the compiler will not optimize the logic out.
Please refer to the previous post for the workaround snippet
A proper fix This is not directly a Marlin bug, and should be fixed by the Libmaple developers. Marlin's logic can be improved to make it more resilient to such issues.
All of the above are major tasks, and cannot be done overnight. In the meantime patching the library will allow users to use their brand new boards.
@minosg First successful print! Nothing changed except your fix in the libmaple which seems to have solved the problem! I will now continue printing longer prints to make sure it doesn't happen over long periods, but seems your workaround is confirmed - until libmaple is fixed, this workaround should let people use their new boards as you said. Thanks!
@Lexer I have been printing geckos since last night. I have enabled aggressive acceleration, LA, connected a malyanlcd which has the higher baud rate, and added code to spam the serial over usb for stressing. I'm currently at 12 successful prints and no failures.
What we need now is more users testing it with different configurations and someone using octoprint to verify that it's a proper workaround.
I've just finished the first Gecko with high acceleration. Worked flawlessly. I'll test more tomorrow
@minosg I have octoprint just haven't been using it since I just replaced motherboards to the SKR mini. I'll connect it and give it a try - anything special I need to be testing?
Printing with octoprint and host keep alive should cause a freeze. If not we are in the good path
Bug Description
This ticket is aimed to report a correlation on a common but hard to reproduce issue across a variety of unrelated users and configurations. I have been following older tickets and running experiments on a spare machine of mine using the BTT_SKR_MINI board 1.2 and Malyan_LCD
18117
18315
17161
Possibly related
18356
On STM32F1 platforms the watchdog is disabled which explains why it freezes in that mode, but I do no think the issue is limited to them
18226
My Configurations
configurations.zip
Steps to Reproduce
change the baud rate increase it to its maximum
Expected behavior:
I expect all the prints to complete
Actual behavior: At random occurrences the printer freezes with heaters on, and if watchdog is enabled it just resets
Additional Information
As discussed before #18177 , I have confirmed that most of the previous reported workarounds, while they make the issue less frequent, will not fix it. I have confirmed that it freezes when
The only common occurrence across all tickets submitted with related behavior in the last 6 months are:
I have not been able to locate a single report of it happening with steppers on standalone legacy mode.
What I believe is the actually cause of this issue: Theory 1 An interrupt kicks in when already in exception mode. Lr 0xfffffff9 indicates a EXC_RETURN state. When trying to recover the original LR it has been corrupted and the code lands in an undefined space
Theory 2: It could also be possible that the UartInterrupt flag for the screen uart is not properly cleared and the system locks in this loop of libmaple's usart_private.h
Consecutive triggering of the interrupt could cause stack corruption, resulting in a improperly formatted LR register
Last debug session call stack when the incident occurred
Registers state
Workaround/Patches
The issue has been correctly identified and can be mitigated by patching your local maple libary, usually residing inside the platform folder
In windows the location of the file which is needed to be edited is:
C:\Users\YOURUSERNAME.platformio\packages\framework-arduinoststm32-maple\STM32F1\system\libmaple\usart_private.h
In MacOS open a terminal and use the following command to find your platformio folder
open ~/.platformio
In Linux it should be in the same location
cd~/.platformio
The callback for usart in
usart_private.h
needs to be replaced by either patchPATCH 1
PATCH 2
How to choose: