MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
16.19k stars 19.21k forks source link

[2.0.x] A lot of resend errors #10927

Closed GMagician closed 6 years ago

GMagician commented 6 years ago

With last 2.0, using repetier, I see so many USB errors:

mega2560 and ramps 1.4 + XON/XOFF

print seems to be working due to resend action

GMagician commented 6 years ago

@ejtagle

But all the race conditions you pointed out were right.

yes, they were really presents

@repetier

No, we do not support XON/XOFF handling

that's bad, XON/XOFF gives safer communication (but it's also ok since, maybe, only Marlin support it)

@thinkyhead

we have an obligation to the public and we cannot wait forever for a solution

The same will have to go for the endstop code now leading to inexplicable errors in dual-axis homing

I agree for 1.1.8 stable branch, but stopping this kind of work for "debug" branches will fossilize Marlin

@thinkyhead, @AnHardt, @ejtagle I suspect that, since I still have crc error when XON/XOFF is disabled, that my mega may have reached its limits (I get these error only when printing). ejtagle fix has reduced such errors (now I saw only crc, before more and different ones) When we enable critical sections we stop every interrupt this, I suspect, will give to serial communication a little bit of breath, enough to "not miss" chars. I'll continue to investigate to see if I'm able to identify cause.

repetier commented 6 years ago

Why should XON/XOFF be safer? Safety is only by checksums to protect against communication errors. XON/XOFF is maybe even more error prone as a com error hitting the control char would cause com errors. All I see is that communication might be faster, if you get no errors. On the other side it uses special chars and therefor it is not implementable in firmwares using binary transfers (repetier-firmware).

GMagician commented 6 years ago

@repetier is safer because it can detect when receiver has its own buffer full and stop sending data.

EDIT XON/XOFF may be also used in binary protocols (using escape codes). Maybe safer is not correct "word" but I home concept is passed

repetier commented 6 years ago

We have no buffer overrun problem as we count the bytes send and bytes acknowledged with "ok". That is what disbaling ping pong is for. It is essentially the same speed gain without need of XON/XOFF chars just by doing the data counting on our side.

AnHardt commented 6 years ago

@GMagician

is safer because ...

No. The used protocol ensures not to overfill the buffer.

PingPong

As long the host never sends a line longer than the rx-buffer all is well and can't go wrong.
This has large latencies between the commands.

To improve the latencies you can tell the host about the rx-buffer size.

Against that I invented the "Emergence Command Parser". A few important commands are parser directly in the RX-interrupt bypassing the RX-buffer. So now a "stop" can overtake the queue in the buffer. To make that work the host has to support that. He has to violate the above protocols. He has to know that he can send, this few commands, even if not have received an 'ok'. (As far as i know, up to now, no host implemented this. But it's relatively easy to implement - compared to)

XON/XOFF is usually implemented in the serial driver (on the host side). Application programmers usually use the ready made drivers for a device. Violating the XON/XOFF protocol by purpose, (for the host developer) would mean to make his own device drivers. That will never happen.

GMagician commented 6 years ago

@AnHardt thanks, detailed explanation

GMagician commented 6 years ago

Disabled xon/xoff and increased buffer in repetier to 768 (3/4 of 1024) but same errors: 'checksum' and 'last line number+1'

GMagician commented 6 years ago

I'm trying to analyze traffic but I can't find a valid "free" sniffer for win10

ejtagle commented 6 years ago

@GMagician: For some reason, you are still losing chars. May i ask if your Ramps board has a CH340 USB to serial converter, or if it has an Atmega8u2 acting as a USB to serial converter ? I seem to recall that there was a problem with older firmwares used for the USB to serial conversion, and there were reports of lost characters... (https://www.arduino.cc/en/Hacking/DFUProgramming8U2)...

You could try the following sketch: test_serial_sketch.zip

This sketch implements a basic Gcode parser, that only verifies checksums (and will also print the received line if an error is detected)

The following file is just a GCode file with proper line numbers and checksums gcode with checksums.zip

You must use a Serial terminal that supports the XON/XOFF protocol to send it: I use YAT https://sourceforge.net/projects/y-a-terminal/

The idea is the following: -Write the sketch to the Mega2560 board (powering it from USB is fine, you don´t need or want to power motors, as ports are not being initialized) -Open YAT, select port, set port speed to 250000, Protocol to Xon/Xoff software flow control and in "text settings" you must select as EOL sequence -In YAT, select the given Gcode file (decompress it to a suitable location) and send it -You should get tons of "ok" but no errors.

Alternatively, you can use Marlin to test this: In that case you will need an SD card plugged in the SD slot (if you don´t have any, then you cannot use this method, i really doubt you would actually want to execute the Gcode) -Just flash marlin, but configured for XON/XOFF support -Open YAT, select port, set port speed to 250000, Protocol to Xon/Xoff software flow control and in "text settings" you must select as EOL sequence -Send the next command to start saving all the received commands to the SD card in a file called gtest.gco: M28 gtest,gco -Finally send the supplied file. You should get a bunch of "ok" but no errors.

The 3rd alternative that you can try is a dry run: -Just flash marlin, but configured for XON/XOFF support -Open YAT, select port, set port speed to 250000, Protocol to Xon/Xoff software flow control and in "text settings" you must select as EOL sequence -Send the next command to disable extruder M111 S255 -Finally send the supplied file. You should get a bunch of "ok" but no errors. - The printer will actually move, but no heating of the bed or extruder and no extrusion. You should get no errors.

In all cases, YAT allows to LOG all the sent and received data (Enable it in the LOG menu), and that can help to find out in what case or what kind of corruption is happening...

GMagician commented 6 years ago

Thanks @ejtagle, on my mega I don't have for sure ch340 chip but I don't know what is used.

Sending file I get: error.txt I also get errors...

GMagician commented 6 years ago

I don't think it's a USB bridge bug. Replacing serial code with CRITICAL_SECTION it works perfectly

ejtagle commented 6 years ago
N32149 G1 X96.172 Y104.346 E67
Error:checksum m
ismatch: SENT:32 COMPUTED:9
RXD line: 897 G1 X1*099760 7 3  5Y1 X4Y3 13Y1E1303163 G1 X11G1500 E679.Y202
D:77 Q:1023
9.41064*95
N32150 G1 X96.43 Y104.4 E679.41503*101

This specific piece is telling that there were dropped characters (D:77 means 77 bytes were dropped), and Q:1023 means all the RX buffer was used at that time.

The main question here is why...

This means either the XON/XOFF was not sent, was not received, or the terminal is ignoring it, or something is filtering it out...

In YAT: New terminal-> yat-1 Then Terminal->Settings->Text Settings-> yat-2

Seems there is something in the middle that is buffering the data... The Atmega2u8 that is used in the Arduino Mega2560 has a 128 byte buffer for both TX and RX, and a polling interval of 1mS, so, going up to 250Kbits per second, it could introduce a latency of up to 25 * 2 = 50bytes. Having a 1024 bytes buffer should be more than enough to handle them

The other thing would be to know where exactly a critical section "solves" it -- Maybe it is a workaround for something more serious...

GMagician commented 6 years ago

Image 1 is what I have..when terminal open it says XON. I have to set EOL to LF otherwise I see ok < LF >

but really I never see XON/XOFF during file sending and this is really misterious

GMagician commented 6 years ago

But what about same errors when XON/XOFF protocol is off? repetier has its own logic to limit buffer....

GMagician commented 6 years ago

I have to check one thing.... I hope my mega don't have FTDI chip...I recently read something weird about company behaviors https://hackaday.com/tag/ftdi-gate/

second attack is maybe what I could facing....

ejtagle commented 6 years ago

Yes, Repetier has its own protocols, i agree... It should not happen, but something is happening, i have no idea what. On the YAT log i do not have XON/XOFF chars. The XON/XOFF chars are probably handled by the windows serial driver, that is why i dont see them. But if I disable Xon/Xoff on YAT, and keep exactly the same sketch running, then the transfer goes way faster and gives tons of error and invalid checksums. YAT seems not to be displaying those XON/XOFF chars. But, if you enable raw logging, yat-3 And open the log with a hex editor, you can look for the 0x11 and 0x13 chars and you should see them: In my case, if disabling Xon/Xoff in YAT, i get: yat-5

You can also switch YAT to Manual Software XON/XOFF. The difference is that instead of trusting the windows driver, YAT implements its own XON/XOFF parsing. And in that case you WILL see the XON/XOFF chars. This is an screenshot on how it looks here:

yat-6

This could also be a driver issue...

GMagician commented 6 years ago

Wow....I don't understand.... my yat setup: i1 i2

log got: YAT-Log-20180609-082505.zip

it seems that XOFF doesn't stop anything

GMagician commented 6 years ago

With "manual xon/xoff" it works. Then from xon/xoff point of view it seems that my arduino drivers don't handle xon/xoff. With this we can close xon/xoff issue part. But Why disabling xon/xoff (define commented) I still get errors using repetier but with critital section not? (Yesterdeay I had a 2 hours print with critical section active with no error at all)

Sineos commented 6 years ago

Kind of sounds like the same discussion: https://github.com/grbl/grbl/issues/50#issuecomment-4143431

GMagician commented 6 years ago

@Sineos in my case is not only an xon/xoff issue. I also programmed printer with no XON/XOFF and my errors are still there. If I restore CRITICAL_SECTION in serial.c everything works (Yesterday with current 2.0.x branch and such "fix" I made a 2 hours print with no issues)

GMagician commented 6 years ago

Now another test. @ejtagle I used your sketch modified to let me run repetier. I modified your gcode file removing crc and I printed it... I don't use XON/XOFF in your skecth so I use repetier buffer handling (I set cache to 256)

This is what I get: a.txt

GMagician commented 6 years ago

Wait Wait Wait...maybe I got it I modified your sketch to echo acknowledged line number, this is what I get when idle

10:21:15.386 : N50 M105*18 10:21:15.402 : echo:50 10:21:15.402 : ok 10:21:15.402 : echo:0 10:21:15.402 : ok

double 'ok' ack, this can destroy repetier buffer logic handling

EDIT: Ok, your parser don't handle 0x0D in line feed. adjusting parser to handle it and everything works... Now I'll check Marlin Parser (just to be sure)

EDIT2: Marlin parser is immune to that

GMagician commented 6 years ago

Ok, let's resume current status: 1) XON/XOFF is working correctly with these fix 2) XON/XOFF is not handled by repetier not by my drivers 3) checksum error/line error are not seen using "test" software, even with repetier 4) checksum error are seen using "marlin" only when RX interrupt are disabled, when CRITICAL_SECTION is used everything works as expected

ejtagle commented 6 years ago

-The fact that XON/XOFF only works in manual mode, means your (windows/macos/linux) drivers are non compliant and do not handle XON/XOFF (linux is known to have that issue, Windows should work unless the driver itself is not handling them (incomplete driver)!) -Critical section disables not only RX interrupts, all interrupts, including stepper and temperature isrs... -The test software (sketch) is using exactly the same code as Marlin ...

The conclusion to this could be that perhaps the problem is caused by something else inside Marlin... Either an ISR altering something, or memory corruption, or....

You can try writing to the SD card using Marlin and YAT.

Alternatively, you can use Marlin to test this: In that case you will need an SD card plugged in the SD slot (if you don´t have any, then you cannot use this method, i really doubt you would actually want to execute the Gcode)
-Just flash marlin, but configured for XON/XOFF support
-Open YAT, select port, set port speed to 250000, Protocol to Manual Software Xon/Xoff software flow control and in "text settings" you must select as EOL sequence
-Send the next command to start saving all the received commands to the SD card in a file called gtest.gco:
M28 gtest,gco
-Finally send the supplied file. You should get a bunch of "ok" but no errors.

But i guess it will show the same problems.

Next steps would be to disable the stepper ISR and repeat the above. Or disable the temperature ISR, disable watchdog and repeat the above. We must find out the cause, that is the main issue here..

We could also add the supplied sketch inside Marlin, trigger its execution with M111 and see if it works. The idea is to find out if there is something else causing these problems....

gloomyandy commented 6 years ago

Hi, a couple more data points for you.... I've been testing the bugfix-2.0.x branch from a couple of days ago (the last commit I have is eb1026d). So this version will not have the changes in @ejtagle private build. My printer uses an mega2560 based MKS base 1.3/1.4 clone control board. The PC side windows drivers are the standard Arduino USB serial port setup.

I've been using the latest release of Repetier-Host (version V2.1.2) and a sample gcode file of approx 30000 lines (takes about 2 hours to print using dry run mode). With the default settings of baud rate 250000, TX_BUFFER 0, RX_BUFFER not defined and XON turned off. Then I can run the print with zero errors. However if I change the RX_BUFFER setting to be 1024 (still with XON turned off) then I get 8 errors while running the same print. I've only performed this test once with each setting so far, so perhaps not conclusive in any way, but I thought the result may be of interest as I noticed that the code seems to follow a different path depending upon the buffer size setting.

I'm happy to run more tests if you think this will help.

GMagician commented 6 years ago

@ejtagle I'm using arduino windows drivers to communicate and I confirm that they don't manage correctly XON/XOFF.

RX_BUFFER_SIZE > 256 check shouldn't be RX_BUFFER_SIZE >= 256?

ejtagle commented 6 years ago

@GMagician : Basically, the condition RX_BUFFER_SIZE should be > 256 The reason is that we are using a circular FIFO, so an 8bit index is able to represent up to 256 different values.

My Arduino Mega is a chinese clone that is using the CH340 chip as a USB to serial bridge. And the windows driver works perfectly with XON/XOFF handshake

GMagician commented 6 years ago

@ejtagle but since you test and array size, 256 items in array are 2 bytes

GMagician commented 6 years ago

@ejtagle also mine mega is chinese but I'm sure it doesn't use ch340 but something else. Looking at COM dirvers windows says "copyright arduino.ccc"

ejtagle commented 6 years ago

@GMagician : You will always find (if you find something different, its a bug)

value & (ring_buffer_pos_t)(RX_BUFFER_SIZE - 1);

This always translates to a number < 256 if RX_BUFFER_SIZE <= 256, always fitting in 8bits

I think that the problem we are facing here is that for some reason, RX interrupts are not being serviced as fast as needed under some "undetermined" situation.

If it was related to RX_BUFFER_SIZE, defining RX_BUFFER_SIZE to 128 bytes should solve all problems... Specially with the "ok" polling protocol used by Repetier...

gloomyandy commented 6 years ago

@ejtagle perhaps you missed my comments above. I see no errors with a default buffer size (which I think ends up being 128). Perhaps this is just a coincidence but maybe worth taking a closer look at? @GMagician have you tried testing with a RX_BUFFER_SIZE undefined (which is the default setting I think).

ejtagle commented 6 years ago

@gloomyandy : Quite to the contrary: Your report could reinforce a small theory i have about this problem: If the RX_BUFFER is 256 bytes or less, then there is no need for the critical sections at all. This could mean either that the "critical sections" = disabling the RX interrupts is not working, or that there is some kind of latency between disabling the RX interrupt and the actual disabling of the interrupt. I´ve been looking for any kind of documentation on this kind of latency, but didn´t find anything yet... But, if there is some latency, it could mean reading or writing a half inconsistent value of the rx buffer index head or tail, and that would explain a lot of things

GMagician commented 6 years ago

@ejtagle I also added a "nop" after CBI to try to enforce your theory...but none. Now I'm testing SD transfer. I get error when normal transter. I created now this:

#define DISABLE_TEST bool tisr = TEMPERATURE_ISR_ENABLED(); \
                       DISABLE_TEMPERATURE_INTERRUPT(); \
                       bool sisr = STEPPER_ISR_ENABLED(); \
                       DISABLE_STEPPER_DRIVER_INTERRUPT(); \
                       bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx); \
                       CBI(M_UCSRxB, M_RXCIEx)
  #define ENABLE_TEST  if (tisr) ENABLE_TEMPERATURE_INTERRUPT(); \
                       if (sisr) ENABLE_STEPPER_DRIVER_INTERRUPT(); \
                       if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx)

and i'm going to test it

gloomyandy commented 6 years ago

I was surprised at the relatively low error count I got with a buffer of 1K. It could just be that getting no errors with the default setting is a coincidence? I'm happy to test further if it will help. If you would rather I test with your modified build I can do that, or with a different gcode file or Marlin configuration if you think that would make a difference.

ejtagle commented 6 years ago

Basically, i´ve had success even with a 512byte buffer and XON/XOFF enabled...

GMagician commented 6 years ago

@ejtagle SD tests done....when XON/XOFF is used and with YAT I send gcode file, everything is ok, even with interrupts enabled (your original file is working). Before these test I noticed some errors, don't know if normal. I had power off (only 5V via USB) and since I have bltouch that is not powered (external 5V on servo) printer send an error. Now I have line number errors! a.txt

ejtagle commented 6 years ago

Either we are losing chars, or noise, or something... There is an small modification that can tell the cause: Reading the UCSR0A.DORn . If there is a buffer overrun, then this means something is blocking the RX interrupts...

ejtagle commented 6 years ago

I´ll add support for such debug feature... :+1:

ejtagle commented 6 years ago

Well @GMagician ... Me again, abusing the poor M111 command to print serial statistics...

serial-debug.zip

You need to add to your configuration_adv.h file the following:

#define SERIAL_STATS_RX_BUFFER_OVERRUNS 1
#define SERIAL_STATS_RX_FRAMING_ERRORS 1
#define SERIAL_STATS_MAX_RX_QUEUED
#define SERIAL_STATS_DROPPED_RX

Probably the latter 2 are already enabled.

Then, by issuing M111 you will be able to know if there was a buffer overrun (that means the RX interrupt was disabled for too much time an a RX char was lost. You also will get the Framing Error counter (noise can cause RX problems), and also RX queue maximum size and if some chars were lost due to RX being full at the time of adding an extra char to it...

Hope this allows to know the cause of the CRC errors/line errors...

thinkyhead commented 6 years ago

The serial changes are now isolated in the PRs #10982 and #10983.

ejtagle commented 6 years ago

Yes, the previously attached files were exactly the ones you separated into those PRs. We are still trying to figure out the problem. The problem seems only to happen when printing, but not when transferring files to the SD card. There is no such huge difference between scenarios, except that the stepper ISR rate, when not printing, is reduced to 1000 interrupts per second, as compared with printing, that probably forces an increased rate... I have written an independant sketch that uses the Marlin serial classes and also creates a timer running at 12khz with a delay inside that emulates the same load. And i am unable to make it fail... That is why i added those debug options. We need to know if it is the RX FIFO becoming full, or an internal UART buffer overrun (that will indicate the RX isr is not being run fast enough...)

gloomyandy commented 6 years ago

More testing using @ejtagle updated code with debug stats. Same gcode as above (approx 2 hour print), RX_BUFFER_SIZE 1024, XON/XOFF not enabled. M111 reports the following: 12:02:20.315 : echo:DEBUG:offBuffer Overruns: 2 12:02:20.315 : Framing Errors: 0 12:02:20.315 : Dropped bytes: 2 12:02:20.315 : Max RX Queue Size: 201

I had two checksum errors reported by Repetier. Going to try the same test again with the default buffer size.

GMagician commented 6 years ago

@ejtagle I think you got it.... 15:27:57.381 : echo:DEBUG:offBuffer Overruns: 3 15:27:57.381 : Framing Errors: 0 15:27:57.381 : Dropped bytes: 9 15:27:57.381 : Max RX Queue Size: 1023

EDIT: RX queus size = 1023 but repetier has 768 as cache...how this is possible?

gloomyandy commented 6 years ago

Completed a run with the default RX_BUFFER_SIZE (and obviously no XON/OFF). No errors reported by Repetier. M111 gives: 17:23:00.744 : echo:DEBUG:offBuffer Overruns: 0 17:23:00.748 : Framing Errors: 0 17:23:00.748 : Dropped bytes: 0 17:23:00.749 : Max RX Queue Size: 126

Interesting that the max queue size is lower in this mode as well as no errors.

So I assume this means that the RX interrupts are off for too long in some situation? Any idea what that situation may be? Is it just a combination of all of the other interrupts happening and making the off period exceptionally long? Or do you think it is just one of the other interrupts is taking an especially long time? As I said before I'm using the main bugfix-2.0.x from a few days ago. I have S curve enabled and also junction deviation. Is there any way that some combination of events could be leaving RX ints turned off after the call has completed?

ejtagle commented 6 years ago

@gloomyandy : In your case, on the first try, you got 2 overrun errors - Those are counted both as Overrun errors and as Dropped bytes - So, the problem is just that some RX isrs were not serviced in time, and the received character was lost - Going at 250000 bits per second means a 25Khz serial ISR rate... Pretty high...

On your 2nd try, there are not errors at all. No buffer overruns (means all RX interrupts were serviced in time), No framing errors (no noise in the communication), No dropped bytes (Means that the FIFO queue never became full)... As you mention, there is something strange: Your 2nd try, even using an smaller FIFO, is using less entries ... 126 bytes maximum, and you get no overruns at all..

@GMagician : Your case is even more strange! ... There were 3 overruns (RX isrs not being serviced fast enough, and 9 -3 = 6 dropped bytes just because the FIFO became full and there was no space to store the received bytes...

What is the difference between things ? ... If the RX FIFO size is less than or equal to 256, no disabling of the RX interrupt is done... Maybe that helps a bit with the RX isr ?

I have 2 different theories here: One of them is that disabling the RX interrupt and reenabling it sometimes loses RX interrupts... It should not happen, but the documentation on this is missing.. And Dropped chars could be explained by that - And that would also explain why a critical section solves those issues...

The other is that something is just taking too long to complete with RX interrupts disabled.

@GMagician Can you try setting your RX fifo size to 256 or less and see if it solves the problem ?

We all agree that "ok" protocol, as used by Repetier should prevent complete buffer overruns. And it should be bomb-proof .. The only thing that could fool that protocol is if the serial (windows) driver is either repeating received lines, or, if it delays the answer... and Repetier fails to drain it, so essentially, Repetier could be taking a previous "ok" and associating it to an incorrect line...

GMagician commented 6 years ago

RX buffer = 256 1hr print and no issues...I have to wait 1 more to finish but never happened without critical sections

GMagician commented 6 years ago

@ejtagle

The other is that something is just taking too long to complete with RX interrupts disabled.

I will try to disable also temperature and isr interrupt (code posted some post above) and test. Don't know how many other interrupts are enabled but this test should check if your 2nd theory is right.

At the moment it really seems that disabling RX interrupt sometime miss pending interrupts (answer given by 256 buffer size)

ejtagle commented 6 years ago

Believe it or not, there IS a way to perform atomic reads of 16 and/or 32bit values without disabling ISRs. The implementation consists on reading until 2 consecutive reads return the same value... I'll do the implementation for the case where RX_BUFFER > 256, just for completness sake and XON/XOFF handling... ;)

gloomyandy commented 6 years ago

Yep I was thinking this might be a way forward. The code is not pretty (and you need to be careful that the compiler does not optimize it in some unexpected way!). I'm currently testing disabling the RX int with a buffer size of 256 just to see if I still get errors. This will hopefully confirm that it is not simply a case of the 256 byte buffer code being more efficient.

ejtagle commented 6 years ago

The 256 byte buffer is marginally mor efficient, but has the added benefit that under AVR, setting an 8bit memory variable, or reading an 8bit variable is an atomic operation. That is why there is no need to disable the RX isr when reading or writing an 8bit index, The 'volatile' keyword and some clever usage of memory barriers can solve and ensure the "atomic" read of 16bit variables by reading more than once until you get the same value does not get optimized out...