MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
15.97k stars 19.09k forks source link

[BUG] Random freezes during printing over serial connection (skr 1.4 turbo) #21010

Open m-kozlowski opened 3 years ago

m-kozlowski commented 3 years ago

Bug Description

I'm experiencing radom freezes when printing with Octoprint over serial connection. Using SKR 1.4 turbo, RPI4, directly connected to TFT serial interface (SERIAL_PORT 0). Running bugfix-2x 004bed8a7fc3ff9feb73a0ea9794635b50073c27 I am not able to identify any specific circumstances of the error - sometimes it happens on first layer, and sometimes after few hours of printing the same gcode. Print head stops, screen (ulti controler) stops refreshing, communication freezes (there is "timeout" message in octoprint terminal window).

I'm suspecting it's related to recent serial changes, because:

Configuration Files

Configuration.zip

rhapsodyv commented 3 years ago

Can you test using another cable and another computer?

Skr uses usb serial, that is a lot stable than normal hardware serial. In the past, some users reported something similar and the issue was in the host.

m-kozlowski commented 3 years ago

Sure, will try. But it will be another rpi4 as I don't have anything else with hardware serial at hand.

However I'm not convinced if it's host to blame. It has printed hundreds of hours in this setup. And right now it's finishing ~18 hour print running downgraded Marlin.

rhapsodyv commented 3 years ago

@X-Ryl669 can it be related with #20783 and #20932 ?

rhapsodyv commented 3 years ago

Which version of marlin is working fine for you?

X-Ryl669 commented 3 years ago

What happen when the serial is freezing ? Do you have emergency parser enabled (previously) ? If yes, can you send some emergency commands to see if it react ?

Can you post the last serial output log so we can see what's the last message that worked ?

X-Ryl669 commented 3 years ago

I've read your configuration file.

I'm trying to validate some hypothesis here:

  1. Checking if the printer is crashed when it happens
  2. Checking if the g-code parser is waiting for more data and is struck
  3. Checking if my changes have changed the behavior of the printer

For 1:

For 2:

For 3:

m-kozlowski commented 3 years ago

Applied changes as described above and started ~4 hours print with serial communication logging enabled. Will report back.

In the meantime: I ran few short prints overnight, and was unable to trigger issue. Both on new-rpi4 and old-rpi4. The only thing that changed is that i plugged USB cable in, while still printing on hardware serial. Not much to base any conclusions, but it might be some grounding issue (?). Gonna investigate further.

m-kozlowski commented 3 years ago

Today I haven't been able to repeat the problem on any of the two raspberry pi's and any version of Marlin. I'll leave MARLIN_DEV_MODE and EMERGENCY_PARSER enabled and reopen the ticket if I capture anything. Meanwhile, I am closing the ticket, recognizing that it could be a wiring problem.

X-Ryl669 commented 3 years ago

Or it's a bug that's disappearing when EMERGENCY_PARSER is enabled... Feel free to reopen if it re-appear once you disable emergency parser.

m-kozlowski commented 3 years ago

Happened again. Bad news is I was running 9d24ee8daf0773ca1870a5c156a253458cc2ee1b without "post mortem" PR. Better news is that EMERGENCY_PARSER and MARLIN_DEV_MODE was on.

Printer does not react to M108. Last lines of serial communication:

Send: N119024 M117 98% L=225/240*98
Recv: //action:notification 98% L=225/240
Recv: ok N119024 P0 B3
Send: N119025 G1 F3840.000*97
Recv: ok N119025 P1 B3
Send: N119026 G1 X109.294 Y110.515 E-0.19172*120
Recv: ok N119026 P0 B3
Send: N119027 G1 X109.518 Y110.953 E-0.18245*114
Recv: ok N119027 P0 B3
Send: N119028 G1 X109.663 Y111.429 E-0.18478*123
Recv: ok N119028 P0 B3
Send: N119029 G1 X109.706 Y111.711 E-0.10580*126
Recv: ok N119029 P0 B3
Send: N119030 G1 X109.722 Y112.000 E-0.10737*122
Recv: ok N119030 P0 B3
Send: N119031 G1 X109.706 Y112.289 E-0.10737*126
Recv: ok N119031 P0 B3
Send: N119032 G1 X109.663 Y112.571 E-0.10580*113
Recv: ok N119032 P0 B3
Send: N119033 G1 X109.518 Y113.047 E-0.18478*112
Recv: ok N119033 P0 B3
Send: N119034 G1 X109.294 Y113.485 E-0.18245*118
Recv: ok N119034 P0 B3
Send: N119035 G1 X109.129 Y113.721 E-0.10706*117
Recv: ok N119035 P0 B3
Send: N119036 G1 X108.938 Y113.941 E-0.10789*112
Recv: ok N119036 P0 B3
Send: N119037 G1 X108.727 Y114.138 E-0.10724*119
Recv: ok N119037 P0 B3
Send: N119038 G1 X108.489 Y114.318 E-0.11077*127
Recv: ok N119038 P0 B3
Send: N119039 G1 X108.248 Y114.463 E-0.10426*127
Recv: ok N119039 P0 B3
Send: N119040 G1 X107.790 Y114.659 E-0.18490*112
Recv: ok N119040 P0 B3
Send: N119041 G1 X107.507 Y114.735 E-0.10892*112
Recv: ok N119041 P0 B3
Send: N119042 G1 X107.218 Y114.781 E-0.10849*115
Recv: ok N119042 P0 B3
Send: N119043 G1 X106.925 Y114.796 E-0.10868*115
Recv: ok N119043 P0 B3
Send: N119044 G1 X106.633 Y114.781 E-0.10868*122
Recv: ok N119044 P0 B3
Send: N119045 G1 X106.348 Y114.736 E-0.10704*123
Recv: ok N119045 P0 B3
Send: N119046 G1 X105.875 Y114.591 E-0.18365*122
Recv: ok N119046 P0 B3
Send: N119047 G1 X105.613 Y114.469 E-0.10731*126
Recv: ok N119047 P0 B3
Send: N119048 G1 X105.365 Y114.320 E-0.10717*123
Recv: ok N119048 P0 B3
Send: N119049 G1 X104.980 Y114.008 E-0.18388*121
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
Send: N119050 M105*27
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
Send: N119051 M105*26
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
Send: N119052 M105*25
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
Send: N119053 M105*24
Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
Send: N119054 M105*31
No response from printer after 6 consecutive communication timeouts, considering it dead. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.
Changing monitoring state from "Printing" to "Offline (Error: Too many consecutive timeouts, printer still connected and alive?)"
Connection closed, closing down monitor

I have st-link v2 plugged in that I'm using to flash firmware, but I guess it can be used to do some debugging if someone guide me. I can leave the printer in current state for a while.

X-Ryl669 commented 3 years ago

Ok, so I hope you've openocd installed. In that case, you'll have to start the server:

$ ~/.platformio/packages/tool-openocd/bin/openocd -d2 -s ~/.platformio/packages/tool-openocd/scripts -f interface/stlink.cfg  -f target/stm32f1x.cfg

Replace stm32f1x.cfg by the configuration for your CPU

Then in another terminal, run gdb like this:

$ ~/.platformio/packages/toolchain-gccarmnoneeabi/bin/arm-none-eabi-gdb /path/to/your/marlin/repository/.pio/build/your_env/debug/firmware.elf

While in GDB, type:

(gdb) tar ext:3333
Remote debugging using :3333
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x0801b272 in ?? ()
(gdb) bt

And post the output of the commands above

If you are a bit lost, I've written a post about how to debug Marlin and there for setting up OpenOCD. Skip the part about unlocking the flash since you don't have a STM32.

m-kozlowski commented 3 years ago

(there's no debug subdir, so I'm using /usr/src/Marlin/.pio/build/LPC1769/firmware.elf)

(gdb) bt
#0  0x000019a8 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info frame
Stack level 0, frame at 0x10007fc0:
 pc = 0x19a8; saved pc = <not saved>
 Outermost frame: previous frame identical to this frame (corrupt stack?)
 Arglist at 0x10007fc0, args:
 Locals at 0x10007fc0, Previous frame's sp is 0x10007fc0

(gdb) info symbol 0x000019a8
No symbol matches 0x000019a8.
(gdb) disassemble /m 0x000019a8
No function contains specified address.
X-Ryl669 commented 3 years ago

So, it's useless... :cry: Your printer has crashed. You'll have to use #20492 to get a stack trace to print on the serial port (you can directly pull this, it was updated with bugfix recently). Also, please compile your firmware with platformio debug -e your_env instead of platformio run -e so the debug symbol are built in the elf file and will be useful for debugging.

X-Ryl669 commented 3 years ago

Just a note for developers: by default, the crash handlers in the framework don't remember the crash's stack trace. So until #20492 is merged (which does remember the stack trace and allow backtracing), we're blind when such crash happens.

Snargaloo commented 3 years ago

I am experiencing a similar if not exact issue. It took me a few weeks to isolate this. I know if I use 2.0.x-bugfix at commit c929fb5 everything prints perfectly.

I am unable to successfully compile again until commit c74f972. From this commit forward I experience issues with serial communication that ultimately cause some prints to fail. Not every type of print will fail. Models with straight lines and sharp corners will complete. Models with curves or circles will experience many pauses and exhibit blobs on the surface of the model.

I have used the Arc Welder OctoPrint plugin for a few months and have experienced pleasing results with it. If I enable Arc Welder compression, the print will most likely fail.

In order to find which commit was causing the issue, I had to find a situation that would duplicate the failure consistently.

This is the description of the hardware I am using:

I have a BTT SKR 1.4 Turbo connected to a BTT TFT-35 E3 V3.0 which is connected to an RPi4. I have the RPi4 Bluetooth device disabled and the first PL011 (UART0) is the primary UART. The RPi4 is connected to the TFT from internal GPIO pins on RPi4 to WIFI connector on TFT. The WIFI connector on the TFT automatically echos all data coming to and from the serial connection from the TFT to the SKR 1.4 Turbo.

For software, I do not use the OctoPi image on the RPi4. I use a current build of RaspberryPi OS with a pure Python3 environment. I connect to the printer using /dev/ttyAMA0. I use the current release of the Arc Welder plugin (1.1.0rc2). I have Meatpack enabled in Marlin, however, I have it disabled in OctoPrint since it won't allow OctoPrint to connect to the printer with my setup (See OctoPrint-Meatpack Issue #12).

These are the configuration files that are working: BTT SKR 1.4 Turbo: Marlin bugfix-2.0.x c929fb5 with these changes: changes_for_c929fb5.zip BTT TFT-35 E3 V3.0: BIGTREETECH-TouchScreenFirmware 923c5e0 with these changes: changes_for_923c5e0.zip

All the commits after the above for Marlin will not allow my printer to function properly. I am attaching the most recent configuration I tested that failed: BTT SKR 1.4 Turbo: Marlin bugfix-2.0.x cbf325a with these changes: changes_for_cbf325a.zip BTT TFT-35 E3 V3.0: BIGTREETECH-TouchScreenFirmware 923c5e0 with these changes: changes_for_923c5e0.zip

I will attach the model I was using for testing along with the sliced gcode and the arc-welded sliced gcode:

This is the model STL: Z bracket spacer V3.zip I use the current development branch of Cura that I compile from source. This is the sliced model g-code: Z_bracket_spacer_V3_eSunPETG(Translucent_Orange)_FR8939-.zip This is the Cura sliced model after it is processed through Arc-Welder: Z_bracket_spacer_V3_eSunPETG(Translucent_Orange)_FR8939-.aw.zip

To duplicate the failure, I simply load and print the Arc-Welded gcode in OctoPrint. The result is around line 1200, the serial communication gets out of sync and eventually results in serial timeout. I have attached a terminal log file showing the situation from the terminal view of OctoPrint. The error occurs at line 1212 this time (not always same line): Terminal.zip. The TFT will show a popup with the line mismatch and will beep continuously. I need to cycle power at the printer to recover.

I can provide any additional information or perform testing as requested.

X-Ryl669 commented 3 years ago

Right, from the log, it seems that Octoprint continue feeding lines to the printer while the printer asked to resend. At some point, it fails. Can you post the same terminal output with the previous version (c929fb5) ? The new code shouldn't have changed much from the serial parsing and processing.

The strange thing in the log are:

  1. The printer asks to resend N1213, (it told octoprint to wait when it was sent initially, but octoprint ignored the message)
  2. It then ACK a old line multiple time (N1181) with different G-code each time, so it's like it's not making any progress to the receiving buffer...

Meanwhile, if you can pull #20492 and build that with both "MARLIN_DEV_MODE" and "POSTMORTEM_DEBUGGING" enabled in Configuration_adv.h, when the crash will happen, it should dump the backtrace of the printer's crash so we'll have a better idea about where it failed.

Please notice that the OP had his printer crashed and not doing anything anymore, yours seems to run an emergency/error condition but it still run, so the above might seem less relevant.

X-Ryl669 commented 3 years ago

Ok, I think I have an idea. I'll send you a file to replace and test with it later today.

X-Ryl669 commented 3 years ago

Please apply the following patch on the current bugfix:

diff --git a/Marlin/src/core/language.h b/Marlin/src/core/language.h
index 923ad903cb..a813ec4ba1 100644
--- a/Marlin/src/core/language.h
+++ b/Marlin/src/core/language.h
@@ -131,6 +131,7 @@
 #define STR_WATCHDOG_FIRED                  "Watchdog timeout. Reset required."
 #define STR_ERR_KILLED                      "Printer halted. kill() called!"
 #define STR_ERR_STOPPED                     "Printer stopped due to errors. Fix the error and use M999 to restart. (Temperature is reset. Set it after restarting)"
+#define STR_ERR_SERIAL_MISMATCH             "Serial status mismatch"
 #define STR_BUSY_PROCESSING                 "busy: processing"
 #define STR_BUSY_PAUSED_FOR_USER            "busy: paused for user"
 #define STR_BUSY_PAUSED_FOR_INPUT           "busy: paused for input"
diff --git a/Marlin/src/gcode/queue.cpp b/Marlin/src/gcode/queue.cpp
index c49247912c..063b9b3dfb 100644
--- a/Marlin/src/gcode/queue.cpp
+++ b/Marlin/src/gcode/queue.cpp
@@ -324,29 +324,19 @@ void GCodeQueue::flush_and_request_resend() {
   ok_to_send();
 }

-inline bool serial_data_available() {
-  byte data_available = 0;
-  if (MYSERIAL0.available()) data_available++;
-  #ifdef SERIAL_PORT_2
-    const bool port2_open = TERN1(HAS_ETHERNET, ethernet.have_telnet_client);
-    if (port2_open && MYSERIAL1.available()) data_available++;
-  #endif
-  return data_available > 0;
-}
-
-inline int read_serial(const uint8_t index) {
-  switch (index) {
-    case 0: return MYSERIAL0.read();
-    case 1: {
-      #if HAS_MULTI_SERIAL
-        const bool port2_open = TERN1(HAS_ETHERNET, ethernet.have_telnet_client);
-        if (port2_open) return MYSERIAL1.read();
-      #endif
+// Multiserial already handle the dispatch to/from multiple port by itself
+inline bool serial_data_available(uint8_t index = SERIAL_ALL) {
+  if (index == SERIAL_ALL) {
+    for (index = 0; index < NUM_SERIAL; index++) {
+      if (SERIAL_IMPL.available(index)) return true;
     }
-    default: return -1;
+    return false;
   }
+  return SERIAL_IMPL.available(index);
 }

+inline int read_serial(const uint8_t index) { return SERIAL_IMPL.read(index); }
+
 void GCodeQueue::gcode_line_error(PGM_P const err, const serial_index_t serial_ind) {
   PORT_REDIRECT(SERIAL_PORTMASK(serial_ind)); // Reply to the serial port that sent the command
   SERIAL_ERROR_START();
@@ -468,14 +458,32 @@ void GCodeQueue::get_serial_commands() {
     }
   #endif

-  /**
-   * Loop while serial characters are incoming and the queue is not full
-   */
-  while (length < BUFSIZE && serial_data_available()) {
+  // Loop while serial characters are incoming and the queue is not full
+  bool hasData = true;
+  while (length < BUFSIZE && hasData) {
+    // Unless a serial port has data, this will exit on next iteration
+    hasData = false;
+
     LOOP_L_N(p, NUM_SERIAL) {
+      // Is there data available on this specific port ?
+      if (!serial_data_available(p)) continue;
+
+      hasData = true; // Let's make progress

       const int c = read_serial(p);
-      if (c < 0) continue;
+      if (c < 0) {
+        // This should never happen, let's log it
+        PORT_REDIRECT(SERIAL_PORTMASK(p));     // Reply to the serial port that sent the command
+        #if BOTH(MARLIN_DEV_MODE, POSTMORTEM_DEBUGGING)
+          // Crash here to get more information why it failed
+          BUG_ON("SP available but read -1");
+        #else
+          SERIAL_ERROR_START();
+          SERIAL_ECHOLNPGM(STR_ERR_SERIAL_MISMATCH);
+          SERIAL_FLUSH();
+          continue;
+        #endif
+      }

       #if ENABLED(MEATPACK)
         meatpack.handle_rx_char(uint8_t(c), p);

Test it and let me know if it fixes it. Please beware that this patch will not work if used with #20492 (it's missing some code I haven't submitted yet), so please test with the patch directly applied on bugfix, and if it does not work, checkout #20492, update your config_adv and reproduce the crash and post the crash trace.

Thanks!

Snargaloo commented 3 years ago

I ran the same Arc-Welded g-code again to firmware c929fb5 with the aforementioned changes. The print completed with no issues so I have attached a terminal log showing lines 1200 - 1300 as comparison: Terminal (working).zip.

I will now compile and run the pull #20492 and report results back here.

I received an ST-Link/V2 yesterday if this will be needed for further debugging.

Snargaloo commented 3 years ago

I just read your previous comments about the patch. I have applied this patch to bugfix 31a434b and noticed that there are some lines of code in the patch that read:
+ #if BOTH(MARLIN_DEV_MODE, POSTMORTEM_DEBUGGING) + // Crash here to get more information why it failed

This bugfix commit does not have POSTMORTEM_DEBUGGING option in Configuration_adv.h. Will this patch still perform as expected? Please advise if I should continue.

X-Ryl669 commented 3 years ago

If it does not apply feel free to remove the #if to #else part. It's not important right now, as I haven't sent the BUG_ON macro yet for inclusion.

By looking at the log, it seems that the new code is lagging behind the host a lot compared to the old code. I can't explain it yet. BTW, thank you for your report, it's very detailed.

Snargaloo commented 3 years ago

I ran the same Arc-welded g-code to commit 31a434b with your patch applied for the total changes of: changes_for_31a434b.zip

The print failed in the typical fashion. This is the terminal log starting at line 1200: Terminal (Failure with patched 31a434b).zip

For a more thorough analysis, I decided to enable serial logging in OctoPrint and ran the same g-code again. An anomaly occurred during g29 probing and BL-Touch reported an error near center of bed. The firmware threw an error to OctoPrint which which closed the terminal connection. However, since I have G29_Retry_and_Recover enabled, the G29 process started over again and completed before the firmware executed the Halt function. This is the serial log from this run: serial (BL-Touch Error patched 31a434b).zip

I restarted the printer and repeated the print. This is the serial log: serial (Failure with patched 31a434b).zip

For comparison, I flashed back to the firmware compiled from c929fb5 and ran the same g-code. The print completed with no issues. This is the serial log: serial (Success with c929fb5).zip

Comparing the serial logs, it is apparent that the 31a434b commit (with patch) starts asking to host to wait during idle even before the print is sent over. The serial log from commit c929fb5 does not have the word wait in it even once.

X-Ryl669 commented 3 years ago

In the good log, there is this:

2021-02-15 15:02:07,721 - Recv: e›&m¦üNot SD printing

and this:

2021-02-15 15:02:37,755 - Recv: Not SD printing
2021-02-15 15:02:37,757 - Communication timeout while printing, trying to trigger response from printer. Configure long running commands or increase communication timeout if that happens regularly on specific commands or long moves.

It seems that they are some issue with the communication reliability, and the new code does not succeed dealing with it. Also, while playing the "match the 7 difference" game between the good and bad version, the major difference is that, in the good logs, the printer ACK each line almost immediately, while in the bad logs it's ACK'ing the line with 20 lines of lag.

Can you send me your firmware.elf file ?

As far as I understand, wait message comes from this:

  #if NO_TIMEOUTS > 0
    static millis_t last_command_time = 0;
    const millis_t ms = millis();
    if (length == 0 && !serial_data_available() && ELAPSED(ms, last_command_time + NO_TIMEOUTS)) {
      SERIAL_ECHOLNPGM(STR_WAIT);
      last_command_time = ms;
    }
  #endif

So length is zero when the host has not send any comment yet. last_command_time is zero too (until it receives a command). So it all boils down to serial_data_available() returning 0 in the new code and not in the previous code. I need the firmware.elf of the previous and new code so I can compare the generated binary and figure out the difference.

X-Ryl669 commented 3 years ago

Another remark:

2021-02-15 12:53:56,362 - Send: N1 M115*39
[...]
2021-02-15 12:53:56,783 - Recv: ok N1 P63 B31
then
2021-02-15 12:53:56,803 - Send: M21
2021-02-15 12:54:00,820 - Recv: echo:No SD card
2021-02-15 12:54:00,826 - Recv: ok P63 B31
2021-02-15 12:54:00,828 - Recv: wait

If we look at the given timestamp, the first exchange is correct, the answer coming ~400ms after the query. In the second exchange, the answer comes 4 seconds after the query, so it's a lot larger than the timeout time (set to 1s in your config), so it's logical that the wait message is generated on next serial loop iteration. The real question comes down to why does processing M21 command took 4 seconds ?

EDIT: Seems like SD card initialization is slow, so it's expected.

X-Ryl669 commented 3 years ago

In the good log, the SD card detection also takes 4s to run:

2021-02-15 15:00:28,479 - Send: M21
2021-02-15 15:00:32,503 - Recv: echo:No SD card
2021-02-15 15:00:32,511 - Recv: ok P63 B31

2021-02-15 15:00:32,515 - Send: M155 S2
2021-02-15 15:00:32,575 - Recv:  T:29.37 /0.00 (3870.0000) B:35.16 /0.00 (3824.0000) @:0 B@:0
2021-02-15 15:00:32,578 - Recv: Not SD printing
2021-02-15 15:00:32,594 - Recv: ok P63 B31

Yet no wait is generated, and from the logic of the code, it should have been.

Snargaloo commented 3 years ago

Here is the firmware.elf file for 31a434b commit: firmware.elf (31a434b).zip

This is the firmware.elf file for c929fb5 commit: firmware.elf (c929fb5).zip

Snargaloo commented 3 years ago

The following snippet from the patched 31a434b serial log is what I find most interesting:

2021-02-15 13:24:17,260 - Recv: ok N1043 P0 B31 2021-02-15 13:24:17,263 - Send: N1068 G1 F1200 X110.796 Y113.624 E74.7258210 2021-02-15 13:24:17,266 - Recv: wait 2021-02-15 13:24:17,273 - Send: N1069 M73 P29 R3134 2021-02-15 13:24:17,273 - Recv: Not SD printing 2021-02-15 13:24:17,287 - Recv: ok N1044 P0 B31 2021-02-15 13:24:17,289 - Send: N1070 M117 29% L=4/137 Time Left=31m ETC=13:5572 2021-02-15 13:24:17,413 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 1044 2021-02-15 13:24:17,414 - Recv: Resend: 1045 2021-02-15 13:24:17,418 - Recv: ok N1013 P2 B32 2021-02-15 13:24:17,420 - Send: N1045 G1 F3000 E73.318359 2021-02-15 13:24:17,941 - Recv: Error:Line Number is not Last Line Number+1, Last Line: 1044 2021-02-15 13:24:17,944 - Recv: Resend: 1045

The following events seem to happen: -- The printer confirms that line N1043 is processed -- then the host sends line N1068 -- then the printer sends a wait request -- then the host sends line N1069 -- then the printer sends a Not SD Printing message -- then the printer confirms that line N1044 is processed -- then the host sends line N1070 -- then the printer sends an error that next line in buffer is not N1045 -- then the printer sends a request for re-send of line N1045 -- then the printer confirms line N1013 is processed -- then the host re-sends line N1045 -- then the printer sends an error that next line in buffer is not N1045

I don't understand the logic of why the printer would send this: 2021-02-15 13:24:17,418 - Recv: ok N1013 P2 B32

It still seems to know that it is on line N1044 and wants to receive line N1045. Where does N1013 come from?

X-Ryl669 commented 3 years ago

I'm not seeing anything wrong in the elf files, everything seems as expected.

X-Ryl669 commented 3 years ago

1) Please confirm that you're using the USB's serial output port for those logs. 2) N1013 is exactly 32 lines before N1045. (The same applied to the first logs sent) and your buffer size is also 32, so it's like if the circular buffer wrapped. 3) Anyway, there is no explainable reason that the printer lags behind so much. It does not lag in the old code. The queue code here was not modified by the commit you've identified as faulty.

I've checked the ELF disassembly for both version for serial_data_available and both query the actual device's available method.

X-Ryl669 commented 3 years ago

I don't think the wait message is important. In my understanding of the code (I might be wrong, so I've opened another bug #21104 to query other developers to confirm), the old code should have sent wait message as well, so it's either a bug in the old code, or some timing related stuff that are a bit different. Yet, the new code seems to loop faster than the old code on the serial link, so this does not explain why it's lagging. Worst: in the printer answers Nxxxx Pyy Bzz the zz tells how many buffers are free to be used (so when the printer is answering ok, it should never be greater than 31). You have B32 in the answer meaning that the printer is answering ok to a non existent command and everything is broken then.

Snargaloo commented 3 years ago

As stated in my first post, I am not using a USB port for serial communication. I am using the TX and RX lines of the TFT connection on SKR 1.4 Turbo.

X-Ryl669 commented 3 years ago

Let me double check this.

Snargaloo commented 3 years ago

During these test prints on the SKR 1.4 Turbo, the SD card slot is empty and nothing is plugged into USB ports. On the RPi4, there is a USB keyboard, a USB WIFI adapter, and an HDMI cable attached. Between the SKR 1.4 Turbo and RPi4 there is a BTT TFT-35.

X-Ryl669 commented 3 years ago

Can you try a test print with the USB serial port meanwhile ?

Snargaloo commented 3 years ago

I would have to reconfigure the firmware, remove the TFT, and install a different TFT. If I just attach a USB cable and initiate a print through it, the TFT will fight with the USB connection for control of the SKR 1.4 Turbo.

I am certain it would succeed just through USB. That is why I commented to this thread where the OP is not using a USB connection, but connecting from RPi4 to TFT connector of SKR 1.4 Turbo. Just as I am.

I could convert the machine back to USB, it would take a day or two.

X-Ryl669 commented 3 years ago

Ok, then forget about that. I have hard time figuring out what was MYSERIAL1 on your system (in old code). Can you apply this patch over c929fb5 and try to build. it'll fail and I need the errors the compiler is shouting:

diff --git a/Marlin/src/MarlinCore.cpp b/Marlin/src/MarlinCore.cpp
index b085e6145a..4603200f5b 100644
--- a/Marlin/src/MarlinCore.cpp
+++ b/Marlin/src/MarlinCore.cpp
@@ -868,6 +868,9 @@ inline void tmc_standby_setup() {
  *    • status LEDs
  *    • Max7219
  */
+template <typename T> struct Invalid { static void nothing() { const char invalid[sizeof(T) - 200000]; } };
+void * a = &Invalid<decltype(MYSERIAL1)>::nothing;
+
 void setup() {

   tmc_standby_setup();  // TMC Low Power Standby pins must be set early or they're not usable

It should say something like:

Marlin/src/MarlinCore.cpp: In instantiation of 'static void Invalid<T>::nothing() [with T = MarlinSerial<MarlinSerialCfg<0u> > >]':
Marlin/src/MarlinCore.cpp:872:43:   required from here
Marlin/src/MarlinCore.cpp:871:75: error: size of array is negative
Snargaloo commented 3 years ago

I like this approach much better, I'll get on it.

Snargaloo commented 3 years ago

I need Marlin_Dev_Mode or anything else enabled?

X-Ryl669 commented 3 years ago

No, that's not required. The compiler will say that it can't compile the code, and it'll list the exact type that's being used for MYSERIAL1. I'm not able to figure this out myself. Then I'll be able to check it's the same type that's being used in the current code.

Snargaloo commented 3 years ago
Marlin\src\MarlinCore.cpp:873:12: error: invalid conversion from 'void ()()' to 'void' [-fpermissive] 873 void * a = &Invalid<decltype(MYSERIAL1)>::nothing; ^~~~~~~~~~
void (*)()
Marlin\src\MarlinCore.cpp: In instantiation of 'static void Invalid::nothing() [with T = MarlinSerial]': Marlin\src\MarlinCore.cpp:873:43: required from here Marlin\src\MarlinCore.cpp:872:93: error: size '4294768624' of array exceeds maximum object size '2147483647' 872 template struct Invalid { static void nothing() { const char invalid[sizeof(T) - 200000]; } };

Marlin\src\MarlinCore.cpp:872:75: error: uninitialized const 'invalid' [-fpermissive]   
  872 | template <typename T> struct Invalid { static void nothing() { const char invalid[sizeof(T) - 200000]; } };
      |                                                                           ^~~~~~~
Marlin\src\MarlinCore.cpp:872:75: warning: unused variable 'invalid' [-Wunused-variable]*** [.pio\build\LPC1769\src\src\MarlinCore.cpp.o] Error 1
X-Ryl669 commented 3 years ago

Thanks! You can now drop the patch (git checkout -f)

X-Ryl669 commented 3 years ago

Can you apply this patch on the recent bugfix's version:

diff --git a/Marlin/src/core/serial_hook.h b/Marlin/src/core/serial_hook.h
index ad8ec12b6e..9952a087bd 100644
--- a/Marlin/src/core/serial_hook.h
+++ b/Marlin/src/core/serial_hook.h
@@ -35,14 +35,14 @@ struct BaseSerial : public SerialBase< BaseSerial<SerialT> >, public SerialT {

   void msgDone() {}

-  bool available(uint8_t index) { return index == 0 && SerialT::available(); }
-  int read(uint8_t index)       { return index == 0 ? SerialT::read() : -1; }
+  bool available(uint8_t index) { return SerialT::available(); }
+  int read(uint8_t index)       { return SerialT::read(); }
   bool connected()              { return CALL_IF_EXISTS(bool, static_cast<SerialT*>(this), connected);; }
   void flushTX()                { CALL_IF_EXISTS(void, static_cast<SerialT*>(this), flushTX); }

   // We have 2 implementation of the same method in both base class, let's say which one we want
-  using SerialT::available;
-  using SerialT::read;
+  NO_INLINE bool available()    { return SerialT::available(); }
+  NO_INLINE int  read()         { return SerialT::read(); }
   using SerialT::begin;
   using SerialT::end;

Test again a print and report if it helps ?

Snargaloo commented 3 years ago

Do I also apply the previous patch given here?

X-Ryl669 commented 3 years ago

I've updated the patch in my last comment. You don't need to apply the first patch, I'm just checking that the right code is compiled. Once you've applied the last's comment patch, and built it, please send me the firmware.elf file again so I can assert it's being compiled as expected. Thanks!

Snargaloo commented 3 years ago

I am getting an error during compiling: Terminal Error.log

X-Ryl669 commented 3 years ago

Seems like a cast is required:

diff --git a/Marlin/src/core/serial_hook.h b/Marlin/src/core/serial_hook.h
index ad8ec12b6e..e9aa5f11b2 100644
--- a/Marlin/src/core/serial_hook.h
+++ b/Marlin/src/core/serial_hook.h
@@ -35,14 +35,14 @@ struct BaseSerial : public SerialBase< BaseSerial<SerialT> >, public SerialT {

   void msgDone() {}

-  bool available(uint8_t index) { return index == 0 && SerialT::available(); }
-  int read(uint8_t index)       { return index == 0 ? SerialT::read() : -1; }
+  bool available(uint8_t index) { return (bool)SerialT::available(); }
+  int read(uint8_t index)       { return (int)SerialT::read(); }
   bool connected()              { return CALL_IF_EXISTS(bool, static_cast<SerialT*>(this), connected);; }
   void flushTX()                { CALL_IF_EXISTS(void, static_cast<SerialT*>(this), flushTX); }

   // We have 2 implementation of the same method in both base class, let's say which one we want
-  using SerialT::available;
-  using SerialT::read;
+  NO_INLINE bool available()    { return (bool)SerialT::available(); }
+  NO_INLINE int  read()         { return (int)SerialT::read(); }
   using SerialT::begin;
   using SerialT::end;

The difference being (bool) in front of SerialT::available (2 places), and (int) in front of SerialT::read (2 places).

Snargaloo commented 3 years ago

With new changes, still compiler error: Teminal Error (16-FEB-2021 11-04).log

X-Ryl669 commented 3 years ago

Another try:

diff --git a/Marlin/src/core/serial_hook.h b/Marlin/src/core/serial_hook.h
index ad8ec12b6e..e576aef1f7 100644
--- a/Marlin/src/core/serial_hook.h
+++ b/Marlin/src/core/serial_hook.h
@@ -35,14 +35,14 @@ struct BaseSerial : public SerialBase< BaseSerial<SerialT> >, public SerialT {

   void msgDone() {}

-  bool available(uint8_t index) { return index == 0 && SerialT::available(); }
-  int read(uint8_t index)       { return index == 0 ? SerialT::read() : -1; }
+  size_t available(uint8_t index)   { return SerialT::available(); }
+  int16_t read(uint8_t index)       { return SerialT::read(); }
   bool connected()              { return CALL_IF_EXISTS(bool, static_cast<SerialT*>(this), connected);; }
   void flushTX()                { CALL_IF_EXISTS(void, static_cast<SerialT*>(this), flushTX); }

   // We have 2 implementation of the same method in both base class, let's say which one we want
-  using SerialT::available;
-  using SerialT::read;
+  NO_INLINE size_t available()     { return SerialT::available(); }
+  NO_INLINE int16_t read()         { return SerialT::read(); }
   using SerialT::begin;
   using SerialT::end;
X-Ryl669 commented 3 years ago

(BTW, it seem LPC176x is not respecting Arduino's Serial API interface here).

X-Ryl669 commented 3 years ago

Ok, I've found the reason why we get ok N1013 P2 B32. I don't know if it's related, it still does not explain why the printer is lagging behind.