MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
16.03k stars 19.13k forks source link

[2.0.x] A lot of resend errors #10927

Closed GMagician closed 6 years ago

GMagician commented 6 years ago

With last 2.0, using repetier, I see so many USB errors:

mega2560 and ramps 1.4 + XON/XOFF

print seems to be working due to resend action

GMagician commented 6 years ago

Done some test, issue appeared after #10911 (serial files changes) I'll try to figure out why

Note: my cnf has XON/XOFF, RX = 1024 and TX = 0

I think it can be related to missing 'writeNoHandshake' function by I'll have to check (No xoff char sent)

ejtagle commented 6 years ago

@GMagician ; The XOFF char should be sent inside the store_rxd_char() function.

if (rx_count >= (RX_BUFFER_SIZE) / 8) {

Maybe i missed something...

ejtagle commented 6 years ago

the writeNoHandshake() was removed because, after the cleanup, it ended calling the write() function directly...

ejtagle commented 6 years ago

The logic behind XON/XOFF is quite simple: Read buffer fills only in store_rxd_char(), so that is the only place that can cause an XOFF char to be sent. And read buffer only empties in read(), so that is the only place where XON should be sent.

cdedwards commented 6 years ago

Slic3r sees that line number error all the time on connect

ejtagle commented 6 years ago

I tried XON/XOFF and the codes are being sent as expected... Maybe something has changed in the timing?

GMagician commented 6 years ago

@ejtagle I'm investigating and I have to ask you why have you added

              // clear the TXC bit -- "can be cleared by writing a one to its bit
              // location". This makes sure flush() won't return until the bytes
              // actually got written
              SBI(M_UCSRxA, M_TXCx);

when Tx interrupt is disabled? since interrupts are not used this should be useless (and remove it save some cycles)

GMagician commented 6 years ago

well, there is nothing wrong with this code..just a question. when cleaning receive interrupt bit, if some char will arrive, interrupt will be pending and generated after bit is re-set?

GMagician commented 6 years ago

@ejtagle I think about a race condition. What about MarlinSerial::write after data register empty test if rx interrupt occurs with buffer full? inside interrupt will be sent XOFF but immediately after returning from it is written a new byte

ejtagle commented 6 years ago

@GMagician : On your first question... Probably, if not using TX interrupts the SBI(M_UCSRxA, M_TXCx); can be removed, as it is unneeded 👍 On your 2nd question, yes: The interrupt reception bit is just a mask. It uses the RXCn bit of the UCSRnA register to determine if an interrupt should be generated or not. The RXCn is readonly, the only way to clear it is by reading the received data from the serial port

Race conditions is what could be happening... If the RX buffer is full and you get an RX interrupt, then the XOFF char was already sent by store_rxd_char() when the buffer was becoming full

   // for high speed transfers, we can use XON/XOFF protocol to do
      // software handshake and avoid overruns.
      if ((xon_xoff_state & XON_XOFF_CHAR_MASK) == XON_CHAR) {

        // calculate count of bytes stored into the RX buffer
        ring_buffer_pos_t rx_count = (ring_buffer_pos_t)(rx_buffer.head - rx_buffer.tail) & (ring_buffer_pos_t)(RX_BUFFER_SIZE - 1);

        // if we are above 12.5% of RX buffer capacity, send XOFF before
        // we run out of RX buffer space .. We need 325 bytes @ 250kbits/s to
        // let the host react and stop sending bytes. This translates to 13mS
        // propagation time.
        if (rx_count >= (RX_BUFFER_SIZE) / 8) {

The first 'if' clause makes sure once an XOFF is sent, it will be never sent again unless an XON was previously sent.

The conditions to send an XOFF is if the count of unread chars is rx_count >= (RX_BUFFER_SIZE) / 8 , in your case, 128 bytes. And the buffer has margin for 896 extra bytes

The only way to send an XON is if the count of queued chars is at the read() function

if (rx_count < (RX_BUFFER_SIZE) / 10) {

You could try 2 things: Changing the limits (for example, use an XOFF limit of RX_BUFFER_SIZE) / 16 instead of RX_BUFFER_SIZE) / 8

And use an XON limit of 0, instead of if (rx_count < (RX_BUFFER_SIZE) / 10) {

Also, could try to decrease speed ... The idea is to find out the culprit... I don´t know if there is a way to log the communication from the PC side. On Windows, i have successfully used Realterm with XON/XOFF control to send files,..

Getting a raw log would be useful, as you would get the XON/XOFF chars (coded 0x11 and 0x13) to analyze...

Also, the Marlin code already has counters for dropped chars, with that you can ensure the problem is dropped chars (=the host either not receiving the XON/XOFF chars or sometinh else that causes delay in the processing of those chars...)

GMagician commented 6 years ago

I've done some test and restoring some "CRITICAL_SECTION" calls will solve...continuing

GMagician commented 6 years ago

@ejtagle what i'm supposing about race is this: when 'write' function is sending something, it test for data register free, if after this test and char send an rx interrupt is received and rx buffer is full, rx interrupt will send XOFF but as soon as interrupt exits, 'write' will overwrite xoff with 'c' and xoff will be lost (and maybe also 'c'). This was already present in previous version, but since this may be race issue, different execution times may raise new pending bugs

GMagician commented 6 years ago

it seems that CBI/SBI(M_UCSRxB, M_RXCIEx); are not doing what expected. Replaced them with CRITICALSECTION?!? will solve issue (in 'available' and 'read' functions is enough to see a big difference)

ejtagle commented 6 years ago

Humm... That is very strange... But maybe you are giving the answer... ;)

ejtagle commented 6 years ago

I have to admit that it does not make sense, unless the interrupt is somehow already flagged at the interrupt controller... I really would like to avoid critical sections, because they add latency to the Stepper ISR...

ejtagle commented 6 years ago

I disassembled the code itself, and there is an interesting race condition here...

 ring_buffer_pos_t MarlinSerial::available(void) {
    #if RX_BUFFER_SIZE > 256
      const bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx);
      CBI(M_UCSRxB, M_RXCIEx);
    #endif

      const ring_buffer_pos_t h = rx_buffer.head, t = rx_buffer.tail;

    #if RX_BUFFER_SIZE > 256
      if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx);
    #endif

    return (ring_buffer_pos_t)(RX_BUFFER_SIZE + h - t) & (RX_BUFFER_SIZE - 1);
  }

This is assembled to the following:

 ring_buffer_pos_t MarlinSerial::available(void) {
    #if RX_BUFFER_SIZE > 256
      const bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx);
   1a938:   20 91 c1 00     lds r18, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
      CBI(M_UCSRxB, M_RXCIEx);
   1a93c:   80 91 c1 00     lds r24, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
   1a940:   8f 77           andi    r24, 0x7F   ; 127
   1a942:   80 93 c1 00     sts 0x00C1, r24 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
    #endif

      const ring_buffer_pos_t h = rx_buffer.head, t = rx_buffer.tail;
   1a946:   80 91 44 11     lds r24, 0x1144 ; 0x801144 <rx_buffer+0x400>
   1a94a:   90 91 45 11     lds r25, 0x1145 ; 0x801145 <rx_buffer+0x401>
   1a94e:   40 91 46 11     lds r20, 0x1146 ; 0x801146 <rx_buffer+0x402>
   1a952:   50 91 47 11     lds r21, 0x1147 ; 0x801147 <rx_buffer+0x403>

    #if RX_BUFFER_SIZE > 256
      if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx);
   1a956:   27 ff           sbrs    r18, 7
   1a958:   05 c0           rjmp    .+10        ; 0x1a964 <_Z22get_available_commandsv+0x66>
   1a95a:   20 91 c1 00     lds r18, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
   1a95e:   20 68           ori r18, 0x80   ; 128
   1a960:   20 93 c1 00     sts 0x00C1, r18 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
    #endif

    return (ring_buffer_pos_t)(RX_BUFFER_SIZE + h - t) & (RX_BUFFER_SIZE - 1);
   1a964:   84 1b           sub r24, r20
   1a966:   95 0b           sbc r25, r21
   1a968:   93 70           andi    r25, 0x03   ; 3
   1a96a:   89 2b           or  r24, r25
   1a96c:   09 f4           brne    .+2         ; 0x1a970 <_Z22get_available_commandsv+0x72>
   1a96e:   21 c1           rjmp    .+578       ; 0x1abb2 <_Z22get_available_commandsv+0x2b4>
    false);
}

You will notice that the instruction CBI(M_UCSRxB, M_RXCIEx); is implemented as a

-Read register -Clear bit7 -Write register.

The same register also holds the TX interrupt, so, if for some reason a TX interrupt happens between the read and the write, and that was the last TX interrupt, and it is disabled in the ISR, then TX interrupts will be reenabled and the TX isr will be called when no char is available. That is a big problem.... So, there is a problem here...

ejtagle commented 6 years ago

@GMagician : After careful analysis: You are absolutely right!! ... There is a hidden race condition in the code... Unfortunately CBI/SBI(M_UCSRxB, M_RXCIEx); are not atomic operations on AVR :( That means that in MarlinSerial::available() / MarlinSerial::read(), there is a chance the RX isr will take an incorrect state, because the M_UCSRxB is in the middle of being altered. If that happens, then the XON/XOFF could perhaps end not being sent.. This is a very unfortunate event. I am rethinking if there something that could be done...

Artem-B commented 6 years ago

AVR does have ops to set/clear individual bits in I/O ports atomically and judging by the macro names marlin code did assume that we'll end up with those instructions. I wonder why compiler didn't generate them. https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_SBI.html https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_CBI.html

GMagician commented 6 years ago

@Artem-B that can be done only with I/O registera 0-1f....this one is c1 (if correctly understood chip DS) @ejtagle how did you get disassembler? this could be useful instead of trying code changes.... @ejtagle also peek has the same issue....

ejtagle commented 6 years ago

@Artem-B Because, unfortunately, AVR has no instructions to set the variables or the UCSRnA register bits atomically, and that is what we need. Don´t worry: After a lot of thought, the race conditions can be safely handled (i think!)

Attached is my attempt to handle and fix them. If anyone @GMagician or you want to test them, it would be great! (you will see i am trying to avoid critical sections, to improve pulse jitter of the stepper drivers) MarlinSerial.zip

ejtagle commented 6 years ago

@GMagician : If you have the Marlin.ino..elf file, you can use avr-objdump -S -d Marlin.ino.elf and you will get the disassembled version with source code intermixed. It´s very handy sometimes.

GMagician commented 6 years ago

@ejtagle I'll give a try this evening but in my configuration TX interrupt should not be enabled (TX buffer is 0).

ejtagle commented 6 years ago

It should work also for TX buffer = 0. Unfortunately, the race condition is still there. The idea is to learn to live with it. The only doubt i have right now is if there is any latency between disabling the RX interrupt and no RX isr being called...

ejtagle commented 6 years ago

(if there is any latency, then this file will show the same problems as before...) Right now, the idea is never to touch the RX ISR enabling at the ISR itself. And handle disabling of TX isrs with an implicit retry if it fails

GMagician commented 6 years ago

The idea is to learn to live with it

can't agree, sorry. When host will get 'wrong line error' will resend line and that's ok but when marlin says 'echo:unknown command' reporting some parts of a real G1 line maybe this will lead to something wrong... If I have to choice about jitter on stepper (some distortion on printed part) or missing printed parts (missed commands) I prefer first, what if missed command is T1?

ejtagle commented 6 years ago

Let me correct myself: The idea is not to learn to live with the serial errors. My idea is to implement the serial class in such a way that, even if there is a race condition, it perfectly works and sends the XON/XOFF chars reliably and does not lose any chars. I also think that is unacceptable to have unreliable communication, and if it can´t be solved without the critical sections, then be absolutely sure we will use them! - I repeat, unreliability of the Xon/Xoff algorithm is not acceptable at all. I am just trying to figure out if there is a way to avoid the critical sections, if possible, without comprimising integrity of the communication 👍 - Sorry for not being as clear as i should. English is not my native language... ;)

GMagician commented 6 years ago

@ejtagle

Sorry for not being as clear as i should. English is not my native language...

we have the same problem, so forgive me if I ever will be rude with my sentences

GMagician commented 6 years ago

I still have a doubt about this race condition, already present before:

void MarlinSerial::write(const uint8_t c) {
  while (!TEST(M_UCSRxA, M_UDREx)) sw_barrier();
  --> rx interrupt here that will send xoff
  M_UDRx = c;
}

even if less probable, if h/w can't shift out 'XOFF' byte new char 'c' will corrupt sent data

ejtagle commented 6 years ago

@GMagician: You are absolutely right on that race condition... Sending Xoff from the RX isr is a big problem...I think i have a solution for that... Excellent analysis !!

GMagician commented 6 years ago

@ejtagle I tested attached zip but it doesn't solve, I expected that since TX_BUFFER = 0 means no tx interrupt.

ejtagle commented 6 years ago

@GMagician : I am working on a fix for the race condition. I will stress test it to avoid wasting your precious time... 👍

GMagician commented 6 years ago

@ejtagle

use avr-objdump -S -d Marlin.ino.elf

I'm using platformio where do I find avr?

GMagician commented 6 years ago

@ejtagle

I will stress test it to avoid wasting your precious time...

don't worry for my time, since I'm the only one to have this issue i'm glad to test it but I'll not be available for next 2 days.

ejtagle commented 6 years ago

@GMagician

MarlinSerial.zip This is my latest version. I wrote an sketch to test it, and XON/XOFF are being sent as neccesary. I even tried with a 512 rx buffer with no lost bytes when using XON/XOFF

GMagician commented 6 years ago

I've tested code not solved, still get checksum errors

Note that _written and flushtx are not used when TX_BUFFER = 0 so you can restore that parts

GMagician commented 6 years ago

I even tried with a 512 rx buffer with no lost bytes when using XON/XOFF

no xon/xoff with 512 bytes...look below

#if RX_BUFFER_SIZE >= 1024
  // Enable to have the controller send XON/XOFF control characters to
  // the host to signal the RX buffer is becoming full.
  #define SERIAL_XON_XOFF
#endif
GMagician commented 6 years ago

@ejtagle where did you get avr.exe? which app install it?

ejtagle commented 6 years ago

Let´s go in order... _written and flushTX are not used in the class itself, but there is a macro referencing them.. After close inspection, seems the macro (at core.h) is only defined when a TX buffer is used... In that case, you are right, and we can remove them -I tried by also changing that conditional #if RX_BUFFER_SIZE >= 1024..., just to stress a little bit more the library...

I use Arduino to compile it, and it already installs the AVR compiler. But, you can get the avr-objdump by installing http://www.microchip.com/mplab/avr-support/avr-and-arm-toolchains-(c-compilers) or http://blog.zakkemble.co.uk/avr-gcc-builds/

ejtagle commented 6 years ago

Attached an standalone Arduino sketch that i used to test the MarlinSerial hal.. It just counts characters received ... sketch_jun05a.zip

So, i can just test the serial class with no Marlin or other interrupts running...

GMagician commented 6 years ago

I can see errors only when printer is printing...With no marlin, repetier is not able to proceed

ejtagle commented 6 years ago

I know, i know. Right now the problem is how to proceed to determine the cause of the problems... Is it Marlin (not sending the XON/XOFF chars) ? ... Is it repetier, not recognizing the characters on time ?

GMagician commented 6 years ago

I also tried with buffer = 512 (so no xon/xoff) and I got again crc error. So is not an issue related to xon/xoff

thinkyhead commented 6 years ago

A solution for this bug would be most welcome, since it's one of the critical issues holding back the 1.1.9 release. Reverting to CRITICAL_SECTION_START is also acceptable, because serial handshaking is not a major bottleneck in achieving reasonable print speeds.

ejtagle commented 6 years ago

But all the race conditions you pointed out were right. Does repetier use XON/XOFF even if the printer does not support it ? Maybe they are adding to the CRC the XOFF/XON chars...

ejtagle commented 6 years ago

@thinkyhead : My main concern here is not if reverting or not will fix it. It is not understanding the cause of the problem. Critical sections could be masking a more subtle problem.. All the tests i have done, transferring HUGE amounts of information on purpose, just to make it fail, did not fail at all. That´s something to worry about...

thinkyhead commented 6 years ago

The thing is, we have an obligation to the public and we cannot wait forever for a solution, allowing one issue to restrain all the other fixes and improvements that we need to get out to users. If the previous code works better in the real world, we will simply have to accept it and go back to the old code, whether we understand why it works or not.

The same will have to go for the endstop code now leading to inexplicable errors in dual-axis homing. If we can't figure out why the new and "better" code is broken for these users, then we must go back to the old and "worse" code that has the virtue of working.

AnHardt commented 6 years ago

Maybe they are adding to the CRC the XOFF/XON chars...

Very god point. What makes you think RepetierHost would work in XON/XOFF mode at all? XON/XOFF is relative new to Marlin. The "ok"-protocol always worked without XON/XOFF. Regardless of using ping-pong or the advanced buffersize (default) method. (Important is to tell RepetierHost about Marlins RX buffer size and it seems to help is you tell it only about 3/4 of Marlins buffer size (not tested in RH versions above 2.0).)

AnHardt commented 6 years ago

@repetier Can RH handle XON/XOFF?

ejtagle commented 6 years ago

I redid all my tests. Using a Windows Serial terminal called YAT (https://sourceforge.net/projects/y-a-terminal/), i sent a full gcode file to the SD card (with line numbers and checksums). 2megabytes. Not a single error or retransmission or anything. If disabling XON/XOFF i get lots of overruns. So, i suspect @AnHardt could be right here...

repetier commented 6 years ago

No, we do not support XON/XOFF handling. We only rely on our own traffic counting a sthis is compatible with all versions.