Closed GMagician closed 6 years ago
Done some test, issue appeared after #10911 (serial files changes) I'll try to figure out why
Note: my cnf has XON/XOFF, RX = 1024 and TX = 0
I think it can be related to missing 'writeNoHandshake' function by I'll have to check (No xoff char sent)
@GMagician ; The XOFF char should be sent inside the store_rxd_char() function.
if (rx_count >= (RX_BUFFER_SIZE) / 8) {
Maybe i missed something...
the writeNoHandshake() was removed because, after the cleanup, it ended calling the write() function directly...
The logic behind XON/XOFF is quite simple: Read buffer fills only in store_rxd_char(), so that is the only place that can cause an XOFF char to be sent. And read buffer only empties in read(), so that is the only place where XON should be sent.
Slic3r sees that line number error all the time on connect
I tried XON/XOFF and the codes are being sent as expected... Maybe something has changed in the timing?
@ejtagle I'm investigating and I have to ask you why have you added
// clear the TXC bit -- "can be cleared by writing a one to its bit
// location". This makes sure flush() won't return until the bytes
// actually got written
SBI(M_UCSRxA, M_TXCx);
when Tx interrupt is disabled? since interrupts are not used this should be useless (and remove it save some cycles)
well, there is nothing wrong with this code..just a question. when cleaning receive interrupt bit, if some char will arrive, interrupt will be pending and generated after bit is re-set?
@ejtagle I think about a race condition. What about MarlinSerial::write
after data register empty test if rx interrupt occurs with buffer full? inside interrupt will be sent XOFF but immediately after returning from it is written a new byte
@GMagician : On your first question... Probably, if not using TX interrupts the SBI(M_UCSRxA, M_TXCx);
can be removed, as it is unneeded 👍
On your 2nd question, yes: The interrupt reception bit is just a mask. It uses the RXCn bit of the UCSRnA register to determine if an interrupt should be generated or not. The RXCn is readonly, the only way to clear it is by reading the received data from the serial port
Race conditions is what could be happening... If the RX buffer is full and you get an RX interrupt, then the XOFF char was already sent by store_rxd_char()
when the buffer was becoming full
// for high speed transfers, we can use XON/XOFF protocol to do
// software handshake and avoid overruns.
if ((xon_xoff_state & XON_XOFF_CHAR_MASK) == XON_CHAR) {
// calculate count of bytes stored into the RX buffer
ring_buffer_pos_t rx_count = (ring_buffer_pos_t)(rx_buffer.head - rx_buffer.tail) & (ring_buffer_pos_t)(RX_BUFFER_SIZE - 1);
// if we are above 12.5% of RX buffer capacity, send XOFF before
// we run out of RX buffer space .. We need 325 bytes @ 250kbits/s to
// let the host react and stop sending bytes. This translates to 13mS
// propagation time.
if (rx_count >= (RX_BUFFER_SIZE) / 8) {
The first 'if' clause makes sure once an XOFF is sent, it will be never sent again unless an XON was previously sent.
The conditions to send an XOFF is if the count of unread chars is rx_count >= (RX_BUFFER_SIZE) / 8 , in your case, 128 bytes. And the buffer has margin for 896 extra bytes
The only way to send an XON is if the count of queued chars is at the read() function
if (rx_count < (RX_BUFFER_SIZE) / 10) {
You could try 2 things: Changing the limits (for example, use an XOFF limit of RX_BUFFER_SIZE) / 16
instead of RX_BUFFER_SIZE) / 8
And use an XON limit of 0, instead of if (rx_count < (RX_BUFFER_SIZE) / 10) {
Also, could try to decrease speed ... The idea is to find out the culprit... I don´t know if there is a way to log the communication from the PC side. On Windows, i have successfully used Realterm with XON/XOFF control to send files,..
Getting a raw log would be useful, as you would get the XON/XOFF chars (coded 0x11 and 0x13) to analyze...
Also, the Marlin code already has counters for dropped chars, with that you can ensure the problem is dropped chars (=the host either not receiving the XON/XOFF chars or sometinh else that causes delay in the processing of those chars...)
I've done some test and restoring some "CRITICAL_SECTION" calls will solve...continuing
@ejtagle what i'm supposing about race is this: when 'write' function is sending something, it test for data register free, if after this test and char send an rx interrupt is received and rx buffer is full, rx interrupt will send XOFF but as soon as interrupt exits, 'write' will overwrite xoff with 'c' and xoff will be lost (and maybe also 'c'). This was already present in previous version, but since this may be race issue, different execution times may raise new pending bugs
it seems that CBI/SBI(M_UCSRxB, M_RXCIEx);
are not doing what expected. Replaced them with CRITICALSECTION?!? will solve issue (in 'available' and 'read' functions is enough to see a big difference)
Humm... That is very strange... But maybe you are giving the answer... ;)
I have to admit that it does not make sense, unless the interrupt is somehow already flagged at the interrupt controller... I really would like to avoid critical sections, because they add latency to the Stepper ISR...
I disassembled the code itself, and there is an interesting race condition here...
ring_buffer_pos_t MarlinSerial::available(void) {
#if RX_BUFFER_SIZE > 256
const bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx);
CBI(M_UCSRxB, M_RXCIEx);
#endif
const ring_buffer_pos_t h = rx_buffer.head, t = rx_buffer.tail;
#if RX_BUFFER_SIZE > 256
if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx);
#endif
return (ring_buffer_pos_t)(RX_BUFFER_SIZE + h - t) & (RX_BUFFER_SIZE - 1);
}
This is assembled to the following:
ring_buffer_pos_t MarlinSerial::available(void) {
#if RX_BUFFER_SIZE > 256
const bool isr_enabled = TEST(M_UCSRxB, M_RXCIEx);
1a938: 20 91 c1 00 lds r18, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
CBI(M_UCSRxB, M_RXCIEx);
1a93c: 80 91 c1 00 lds r24, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
1a940: 8f 77 andi r24, 0x7F ; 127
1a942: 80 93 c1 00 sts 0x00C1, r24 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
#endif
const ring_buffer_pos_t h = rx_buffer.head, t = rx_buffer.tail;
1a946: 80 91 44 11 lds r24, 0x1144 ; 0x801144 <rx_buffer+0x400>
1a94a: 90 91 45 11 lds r25, 0x1145 ; 0x801145 <rx_buffer+0x401>
1a94e: 40 91 46 11 lds r20, 0x1146 ; 0x801146 <rx_buffer+0x402>
1a952: 50 91 47 11 lds r21, 0x1147 ; 0x801147 <rx_buffer+0x403>
#if RX_BUFFER_SIZE > 256
if (isr_enabled) SBI(M_UCSRxB, M_RXCIEx);
1a956: 27 ff sbrs r18, 7
1a958: 05 c0 rjmp .+10 ; 0x1a964 <_Z22get_available_commandsv+0x66>
1a95a: 20 91 c1 00 lds r18, 0x00C1 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
1a95e: 20 68 ori r18, 0x80 ; 128
1a960: 20 93 c1 00 sts 0x00C1, r18 ; 0x8000c1 <__TEXT_REGION_LENGTH__+0x7000c1>
#endif
return (ring_buffer_pos_t)(RX_BUFFER_SIZE + h - t) & (RX_BUFFER_SIZE - 1);
1a964: 84 1b sub r24, r20
1a966: 95 0b sbc r25, r21
1a968: 93 70 andi r25, 0x03 ; 3
1a96a: 89 2b or r24, r25
1a96c: 09 f4 brne .+2 ; 0x1a970 <_Z22get_available_commandsv+0x72>
1a96e: 21 c1 rjmp .+578 ; 0x1abb2 <_Z22get_available_commandsv+0x2b4>
false);
}
You will notice that the instruction CBI(M_UCSRxB, M_RXCIEx);
is implemented as a
-Read register -Clear bit7 -Write register.
The same register also holds the TX interrupt, so, if for some reason a TX interrupt happens between the read and the write, and that was the last TX interrupt, and it is disabled in the ISR, then TX interrupts will be reenabled and the TX isr will be called when no char is available. That is a big problem.... So, there is a problem here...
@GMagician : After careful analysis: You are absolutely right!! ... There is a hidden race condition in the code... Unfortunately CBI/SBI(M_UCSRxB, M_RXCIEx);
are not atomic operations on AVR :(
That means that in MarlinSerial::available() / MarlinSerial::read(), there is a chance the RX isr will take an incorrect state, because the M_UCSRxB is in the middle of being altered.
If that happens, then the XON/XOFF could perhaps end not being sent..
This is a very unfortunate event. I am rethinking if there something that could be done...
AVR does have ops to set/clear individual bits in I/O ports atomically and judging by the macro names marlin code did assume that we'll end up with those instructions. I wonder why compiler didn't generate them. https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_SBI.html https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_CBI.html
@Artem-B that can be done only with I/O registera 0-1f....this one is c1 (if correctly understood chip DS) @ejtagle how did you get disassembler? this could be useful instead of trying code changes.... @ejtagle also peek has the same issue....
@Artem-B Because, unfortunately, AVR has no instructions to set the variables or the UCSRnA register bits atomically, and that is what we need. Don´t worry: After a lot of thought, the race conditions can be safely handled (i think!)
Attached is my attempt to handle and fix them. If anyone @GMagician or you want to test them, it would be great! (you will see i am trying to avoid critical sections, to improve pulse jitter of the stepper drivers) MarlinSerial.zip
@GMagician : If you have the Marlin.ino..elf file, you can use avr-objdump -S -d Marlin.ino.elf and you will get the disassembled version with source code intermixed. It´s very handy sometimes.
@ejtagle I'll give a try this evening but in my configuration TX interrupt should not be enabled (TX buffer is 0).
It should work also for TX buffer = 0. Unfortunately, the race condition is still there. The idea is to learn to live with it. The only doubt i have right now is if there is any latency between disabling the RX interrupt and no RX isr being called...
(if there is any latency, then this file will show the same problems as before...) Right now, the idea is never to touch the RX ISR enabling at the ISR itself. And handle disabling of TX isrs with an implicit retry if it fails
The idea is to learn to live with it
can't agree, sorry. When host will get 'wrong line error' will resend line and that's ok but when marlin says 'echo:unknown command' reporting some parts of a real G1 line maybe this will lead to something wrong... If I have to choice about jitter on stepper (some distortion on printed part) or missing printed parts (missed commands) I prefer first, what if missed command is T1?
Let me correct myself: The idea is not to learn to live with the serial errors. My idea is to implement the serial class in such a way that, even if there is a race condition, it perfectly works and sends the XON/XOFF chars reliably and does not lose any chars. I also think that is unacceptable to have unreliable communication, and if it can´t be solved without the critical sections, then be absolutely sure we will use them! - I repeat, unreliability of the Xon/Xoff algorithm is not acceptable at all. I am just trying to figure out if there is a way to avoid the critical sections, if possible, without comprimising integrity of the communication 👍 - Sorry for not being as clear as i should. English is not my native language... ;)
@ejtagle
Sorry for not being as clear as i should. English is not my native language...
we have the same problem, so forgive me if I ever will be rude with my sentences
I still have a doubt about this race condition, already present before:
void MarlinSerial::write(const uint8_t c) {
while (!TEST(M_UCSRxA, M_UDREx)) sw_barrier();
--> rx interrupt here that will send xoff
M_UDRx = c;
}
even if less probable, if h/w can't shift out 'XOFF' byte new char 'c' will corrupt sent data
@GMagician: You are absolutely right on that race condition... Sending Xoff from the RX isr is a big problem...I think i have a solution for that... Excellent analysis !!
@ejtagle I tested attached zip but it doesn't solve, I expected that since TX_BUFFER = 0 means no tx interrupt.
@GMagician : I am working on a fix for the race condition. I will stress test it to avoid wasting your precious time... 👍
@ejtagle
use avr-objdump -S -d Marlin.ino.elf
I'm using platformio where do I find avr?
@ejtagle
I will stress test it to avoid wasting your precious time...
don't worry for my time, since I'm the only one to have this issue i'm glad to test it but I'll not be available for next 2 days.
@GMagician
MarlinSerial.zip This is my latest version. I wrote an sketch to test it, and XON/XOFF are being sent as neccesary. I even tried with a 512 rx buffer with no lost bytes when using XON/XOFF
I've tested code not solved, still get checksum errors
Note that _written and flushtx are not used when TX_BUFFER = 0 so you can restore that parts
I even tried with a 512 rx buffer with no lost bytes when using XON/XOFF
no xon/xoff with 512 bytes...look below
#if RX_BUFFER_SIZE >= 1024
// Enable to have the controller send XON/XOFF control characters to
// the host to signal the RX buffer is becoming full.
#define SERIAL_XON_XOFF
#endif
@ejtagle where did you get avr.exe? which app install it?
Let´s go in order...
_written and flushTX are not used in the class itself, but there is a macro referencing them.. After close inspection, seems the macro (at core.h) is only defined when a TX buffer is used... In that case, you are right, and we can remove them
-I tried by also changing that conditional #if RX_BUFFER_SIZE >= 1024...
, just to stress a little bit more the library...
I use Arduino to compile it, and it already installs the AVR compiler. But, you can get the avr-objdump by installing http://www.microchip.com/mplab/avr-support/avr-and-arm-toolchains-(c-compilers) or http://blog.zakkemble.co.uk/avr-gcc-builds/
Attached an standalone Arduino sketch that i used to test the MarlinSerial hal.. It just counts characters received ... sketch_jun05a.zip
So, i can just test the serial class with no Marlin or other interrupts running...
I can see errors only when printer is printing...With no marlin, repetier is not able to proceed
I know, i know. Right now the problem is how to proceed to determine the cause of the problems... Is it Marlin (not sending the XON/XOFF chars) ? ... Is it repetier, not recognizing the characters on time ?
I also tried with buffer = 512 (so no xon/xoff) and I got again crc error. So is not an issue related to xon/xoff
A solution for this bug would be most welcome, since it's one of the critical issues holding back the 1.1.9 release. Reverting to CRITICAL_SECTION_START
is also acceptable, because serial handshaking is not a major bottleneck in achieving reasonable print speeds.
But all the race conditions you pointed out were right. Does repetier use XON/XOFF even if the printer does not support it ? Maybe they are adding to the CRC the XOFF/XON chars...
@thinkyhead : My main concern here is not if reverting or not will fix it. It is not understanding the cause of the problem. Critical sections could be masking a more subtle problem.. All the tests i have done, transferring HUGE amounts of information on purpose, just to make it fail, did not fail at all. That´s something to worry about...
The thing is, we have an obligation to the public and we cannot wait forever for a solution, allowing one issue to restrain all the other fixes and improvements that we need to get out to users. If the previous code works better in the real world, we will simply have to accept it and go back to the old code, whether we understand why it works or not.
The same will have to go for the endstop code now leading to inexplicable errors in dual-axis homing. If we can't figure out why the new and "better" code is broken for these users, then we must go back to the old and "worse" code that has the virtue of working.
Maybe they are adding to the CRC the XOFF/XON chars...
Very god point. What makes you think RepetierHost would work in XON/XOFF mode at all? XON/XOFF is relative new to Marlin. The "ok"-protocol always worked without XON/XOFF. Regardless of using ping-pong or the advanced buffersize (default) method. (Important is to tell RepetierHost about Marlins RX buffer size and it seems to help is you tell it only about 3/4 of Marlins buffer size (not tested in RH versions above 2.0).)
@repetier Can RH handle XON/XOFF?
I redid all my tests. Using a Windows Serial terminal called YAT (https://sourceforge.net/projects/y-a-terminal/), i sent a full gcode file to the SD card (with line numbers and checksums). 2megabytes. Not a single error or retransmission or anything. If disabling XON/XOFF i get lots of overruns. So, i suspect @AnHardt could be right here...
No, we do not support XON/XOFF handling. We only rely on our own traffic counting a sthis is compatible with all versions.
With last 2.0, using repetier, I see so many USB errors:
mega2560 and ramps 1.4 + XON/XOFF
print seems to be working due to resend action