Closed DouglasPearless closed 7 years ago
ESP8266 and Telnet communication are not yet fully tested. They are in a beta stadium and will be officially released in a future version.
What you describe sounds like a communication stabillity problem. The gcode file is buffered in lw.comm-server and sent line by line to the machine. The server first sends a few lines to fill the planner queue of the firmware and then waits for "ok" to send the next line (and so on). If not all ok's are received, the server gets out of sync and stops working when the planner queue gets empty. When that happens, you should be able to pause and resume to restart the command streaming. Can you check that?
Hi Claudio,
Yes, sometimes the pause-resume works, but most often it does not :-(
Is there anything I can do to help debug this?
Cheers Douglas
On 6/10/2017, at 10:41 PM, Claudio Prezzi notifications@github.com wrote:
ESP8266 and Telnet communication are not yet fully tested. They are in a beta stadium and will be officially released in a future version.
What you describe sounds like a communication stabillity problem. The gcode file is buffered in lw.comm-server and sent line by line to the machine. The server first sends a few lines to fill the planner queue of the firmware and then waits for "ok" to send the next line (and so on). If not all ok's are received, the server gets out of sync and stops working when the planner queue gets empty. When that happens, you should be able to pause and resume to restart the command streaming. Can you check that?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LaserWeb/LaserWeb4/issues/422#issuecomment-334708547, or mute the thread https://github.com/notifications/unsubscribe-auth/ADkh-0EDHqbUpY3EiF66xiqnCLi-FwbLks5spfXTgaJpZM4Pv4jG.
It probably would help if you could send me a log-file of the communication.
Hi Claudio,
In reading the lw-comms.server source code I noticed what appeared to be various timeouts so I went:
cd /Applications/LaserWeb.app/Contents/MacOS
export LOG_LEVEL=3
export SMOOTHIE_WAIT_TIME=5
./LaserWeb
and I was able to run a 5 minute job for the first time end-to-end. :-)
I then tried a very large 50mb file and it ran for quite a while then stopped, I could not pause-resume to make it continue and I did notice this in the logfile.txt:
2017-10-06 10:53:39 Sent: X208.00 S0.0967 Q: 3120068
2017-10-06 10:53:39 Telnet: X208.00 S0.0967
2017-10-06 10:53:39 Telnet:
2017-10-06 10:53:39 Telnet: ok
2017-10-06 10:53:39 Sent: X208.10 S0.0810 Q: 3120068
2017-10-06 10:53:39 Telnet: X208.10 S0.0810
2017-10-06 10:53:39 Telnet:
2017-10-06 10:53:39 Telnet: get statuos
2017-10-06 10:53:39 Telnet: k
2017-10-06 10:53:39 Telnet: <Run,MPos:207.2342,56.2275,10.0000,WPos:207.2342,56.2275,10.0000>
2017-10-06 10:53:39 Telnet:
2017-10-06 10:53:40 Telnet: get status
2017-10-06 10:53:40 Telnet:
2017-10-06 10:53:40 Telnet: <Idle,MPos:208.1000,56.2500,10.0000,WPos:208.1000,56.2500,10.0000>
2017-10-06 10:53:40 Telnet:
2017-10-06 10:53:40 Telnet: get status
2017-10-06 10:53:40 Telnet:
2017-10-06 10:53:40 Telnet: <Idle,MPos:208.1000,56.2500,10.0000,WPos:208.1000,56.2500,10.0000>
of particular interest is:
2017-10-06 10:53:39 Telnet: get statuos
2017-10-06 10:53:39 Telnet: k
Thoughts?
I can send the whole logfile.txt if you want :-)
I just re-ran the large job and the same thing:
2017-10-06 11:10:05 Telnet: ok
2017-10-06 11:10:05 Sent: X288.30 S0.5647 Q: 3120068
2017-10-06 11:10:05 Telnet: X288.30 S0.5647
2017-10-06 11:10:05 Telnet:
2017-10-06 11:10:05 Telnet: get statutsk
2017-10-06 11:10:05 Telnet:
2017-10-06 11:10:05 Telnet: <Run,MPos:287.8671,50.0313,10.0000,WPos:287.8671,50.0313,10.0000>
2017-10-06 11:10:05 Telnet:
2017-10-06 11:10:05 Telnet: get status
2017-10-06 11:10:05 Telnet:
2017-10-06 11:10:05 Telnet: <Idle,MPos:288.3000,50.0500,10.0000,WPos:288.3000,50.0500,10.0000>
2017-10-06 11:10:05 Telnet:
2017-10-06 11:10:05 Telnet: get status
If looks like (perhaps) there needs to be some of retry X times if it does not get an 'ok' ?
The first log looks like the o was received before the last s of "status". In the second case the LF after status and the o was completely missed. This shows that there is a general communication problem that could also affect the sent gcode. I'm not even sure if telnet is a good choice for such a file streaming.
Did you try the normal ESP8266 communication via websockets?
The ok is absolutely needed with serial (RX/TX over ESP8266) as this is the only flow control.
USB communication has additional hardware flow control that avoids buffer overflow.
OK, same problem occurs over WebSockets, could a relatively simple solution (without knowing the source code :-) ) be something like this pseudo code:
while running
if (return_code_from_smoothie <> "ok") //note it is likely that the command sent was executed but the reply lost
{
if ((status == "<Idle") ||(status == "<Run")) //note this is just the first few characters of the string returned
{
return_code_from_smoothie = "ok"; //clean up the lack of flow control
}
}
Nope, this will not work. You will get out of sync and just overflow the firmware buffer. The status report (run/idle/position) is queried by a timer independant of how fast the gcodes can be executed. So you could get multiple status reports while waiting for one gcode to finish. If you are sending new gcode lines on each idle or run status, you will overflow the firmware buffer.
It doesn't make sense to tinker. You need to solve the communication problem (scrambled / lost digits). I belief this is caused by the ESP8266 code, because it does no flow controll at all.
Do you gather received characters until you get a \n and then send the hole line in one, or just path through every received character?
By the way: You can connect a Raspberry Pi to the USB port of the machine, execute the lw.comm-server on the Pi and run the frontend on any network pc. This works stable.
It's a ESP8266 shortage issue. Closing.
I would like to re-open this issue as I have found a workable solution.
In summary, if for what ever reason, the communication is corrupted or lost back from the Smoothie over WiFi for either Websockets OR telnet, server.js has no mechanism to recover from this situation, it always assumes that it will receive an 'ok' and stops transmitting if it does not. Sometimes a pause-resume would fix the situation, but often it did not.
After working on this for some three weeks on and off I have a solution that allows server.js to recover from incomplete or lost transmission replies back from Smoothie.
This has been tested with a >50MB file over telnet and WiFi which took more than 5 hours to run and previously never got more than 5% of the job without a communications failure.
I propose to submit a pull request with this new code for review and hopefully including into LaserWeb.
Broadly speaking, here is the key logic changes to server.js which are the last two if' statements below:
var data = '';
while (responseArray.length > 0) {
data = responseArray.shift();
writeLog('Telnet: ' + data, 3);
if (data.indexOf('ok') === 0) { // Got an OK so we are clear to send
if (firmware === 'grbl') {
grblBufferSize.shift();
}
if (firmware === 'repetier' || firmware === 'marlinkimbra') {
reprapBufferSize++;
}
blocked = false;
send1Q();
} else if (data.indexOf('<') === 0) { // Got statusReport (Grbl & Smoothieware)
if (blocked === true && data.indexOf("<Run") === 0) {
//if we are still blocked and we get "<Run" then we lost the 'ok' back from Smoothie and need to recover
// as while Smoothie has stuff in it's buffer, we have stopped sending and have more to send
console.log("Fixing <Run =>"+data);
blocked = false;
send1Q();
}
if (blocked === true && data.indexOf("<Idle") === 0) {
//if we are still blocked and we get "<Idle" then we lost the 'ok' back from Smoothie and need to recover
// as Smoothie's internal buffer is now empty and we have more to send
console.log("Fixing <Idle =>"+data);
blocked = false;
send1Q();
}
Cheers Douglas
//if we are still blocked and we get "<Run" then we lost the 'ok' back from Smoothie and need to recover
// as while Smoothie has stuff in it's buffer, we have stopped sending and have more to send
This is definitely not true. Please read up on sync vs. async messages.
ok, I have done a quick search of the wiki and cannot find that information, where is that documented? :-)
Quick search doesn't work well with smoothie docs unfortunately. Smoothie's protocol is similar to grbl's "simple" streaming protocol with differences primarily in the async commands. grbl's doc: https://github.com/gnea/grbl/wiki/Grbl-v1.1-Interface
Ah, you are referring to the Smoothie documentation; I work with, and develop with Smoothie firmware and am usually deep in the code with a JTAG adding new features :-)
If I understand what you are saying, you are referring to the kernel.cpp program, specifically:
// return a GRBL-like query string for serial ?
std::string Kernel::get_query_string()
{
std::string str;
bool homing;
bool ok = PublicData::get_value(endstops_checksum, get_homing_status_checksum, 0, &homing);
if(!ok) homing = false;
bool running = false;
str.append("<");
if(halted) {
str.append("Alarm");
} else if(homing) {
running = true;
str.append("Home");
} else if(feed_hold) {
str.append("Hold");
} else if(this->conveyor->is_idle()) {
str.append("Idle");
} else {
running = true;
str.append("Run");
}
Is that correct?
@tbfleming Can you possibly be a bit less cryptic and offer Douglas some direct information rather than telling him to go away and rtfm? He is trying to help improve Laserweb and i'm sure you should appreciate any developer willing to do that?
@tbfleming I am rather familiar with a lot of the Smoothie source code and can see where and when the "Run" and 'Idle" responses are generated and why (due to a query request a '?' in GRBL parlance, or a "get status" request.
If Smoothie's conveyor (its internal processing of "blocks") has run out of things to process it returns "Idle" and "Run" if it is still running.
I was using this as the basis of my analysis of server.js and developed the solution above :-)
If I am missing something here (quite possible as only have a single coffee today), then let me know so I can re-work the solution :-)
Cheers Douglas
Here's how grbl has behaved since before smoothie forked from it. Did smoothie change this?
When this happens, the above patch will resume sending, overflowing buffers.
@tbfleming Can you possibly be a bit less cryptic and offer Douglas some direct information rather than telling him to go away and rtfm? He is trying to help improve Laserweb and i'm sure you should appreciate any developer willing to do that
I can't get any less cryptic than saying that won't work. Maybe I shouldn't say RTFM with smoothie since there's not much M to R, other than for end users.
Repeating what @cprezzi said, which still stands:
Nope, this will not work. You will get out of sync and just overflow the firmware buffer. The status report (run/idle/position) is queried by a timer independant of how fast the gcodes can be executed. So you could get multiple status reports while waiting for one gcode to finish. If you are sending new gcode lines on each idle or run status, you will overflow the firmware buffer.
@tbfleming now I better understand what you are telling me, especially around the buffer overflow and from here I will go back and review how Smoothie is handling its flow control and then report back here, and as required, update my proposed solution.
Here's a specific example where it will go wrong, even if Smoothie's ? handling is not nearly as async as grbl's is:
send: ?G1X100
recv: <Run...>
recv: OK
The first recv will throw things off with that patch.
Thanks for the example, it helps clarify my understanding.
In terms of all the use-case that I can find in reading the source code for Smoothie, below are the ones where an 'ok' is sent back:
Reasons to reply “ok”, in response to: (0) Initial connection to Smoothie (1) Empty string received (2) M110 (Set Current Line Number) (3) M999 when Smoothie is HALTED (4) G53 code (both a valid code and an invalid code) (5) M28 code (Start an upload) (6) M29 (both if the upload completed successfully or writing the file failed) (7) M112 (Emergency stop) (8) M117 (send arbitrary text to the UI Panel (9) M500 (save volatile settings to config-override) (10) M502 (deletes config-override so everything defaults to what is in config) (11) M504 (save to specific config override file) (12) M1000 (pass through lower case commands) (13) A comment (line starts with a ‘;’) (14) A valid G or M code (15) I am not 100% clear on the GRBL responses; still researching that.
I am currently reviewing how the buffer full situation is handled and will post that shortly so I can unsure I am catering for all (known) use-cases.
Cheers Douglas
Don't forget to account for ESP8266 buffer overflow. It's probably what's setting off this whole mess.
I agree; the Arduino websocket sketch has an internal buffer of 1460 characters; the esp-link code i am using for telnet has a 1460 receive and a 1460 transmit buffer and in Smoothie, it has a ring-buffer appears to allow up to 255 characters per item in the ring buffer (unless I have mis-read the source code). The fundamental issues are: (1) What to do if a transmission is corrupted or lost (2) How to reliably handle when the buffer is full is handled as well.
Related to (1): how to reliably detect if a transmission is corrupted or lost and which direction it was lost in. Consider what happens when a G0 or G1 is lost while in G91 mode.
Hmm, a good point and without proper handshaking, guaranteeing the delivery and confirming receipt it is rather difficult. :-)
The (probably) best way is to is add a sequence number and a CRC to each transmission and track the "lines" sent, that would require changes to Smoothie and server.js which at the moment I do not have time to properly implement at the moment; I think the interim solution I have proposed is (IMHO) a step in the right direction as it seems to work well as rather than simply lose a job, it tries to keep processing; thoughts?
I don't believe it's a step in the right direction since it overruns buffers.
Just to remind: If we would have a "handshaking" problem, we sould see that also with USB serial connection, which we don't!
The problem you see (missing/scrambled chars) is probably caused by the ESP8266 code and has to be solved there. Can you provide the code you are using on the ESP8266?
@DouglasPearless Adding CRC and/or sequence numbers would degrade the data to overhead ratio, which results in lower reachable feeds (escpecially on raster engraving). Not very welcome!
@cprezzi actually the USB protocol has quite extensive error control and recovery built in :-)
Considerable error checking and error handling features have been built in to the USB to ensure that it is a reliable method of connecting peripherals to a PC. Data integrity should be comparable to that of an internal expansion bus.
Immunity from data corruption by noise and spikes has been provided by the use of differential logic drivers and shielded cabling. When errors do occur, cyclic redundancy checks (CRCs) performed separately on both the control and data fields of packets will enable 100 per cent recovery of both single and double bit errors. Unrecoverable errors can be detected with a high degree of confidence.
A self-recovery mechanism is built into the messaging protocol, with time-outs for lost and invalid packets. Some error recovery is built into the hardware. The host controller will retry a failed transaction three times before reporting an error to the client software. How a reported error is dealt with is the responsibility of the client software.
Interrupt and bulk data transfers conclude with a handshake packet to provide confirmation that the data was received, or request that it be re-sent if it was not. Delivery of this data is therefore guaranteed, even if the time taken to deliver it is not.
There is nothing like that for what is effectively a UART serial connection, and I believe is an issue as any WiFi dropout will effectively corrupt the transmission, whether a buffer issue (e.g. ESP8266) or an Access Point issue, WiFi channel congestion, etc etc.
So we have a very nice reliable USB connection available, but the serial LW connection, irrespective of the protocol (telnet, web socket, etc) is not providing the same level of robustness our users would expect; that is the issue I want to solve.
More than happy to explore further and totally agree about not want to degrade the throughput so need a smart solution for this (long term).
I guess that beings me full circle in the logic of my response, specifically that I believe need to plan and introduce a (perhaps staged) approach to making LW as reliable over its serial (WiFi) connection as it is over the USB connection;
Thoughts?
Cheers Douglas
@tbfleming Do you mean the ESP-Link buffers over-running?
If so, my thoughts are (and correct me if I have got this wrong :-) ):
(1) LW (server.js) uses "ping-pong" communication where and 'ok' is received back after each transmission
(2) The assumption that I am using is that there is only ever a very small number of 'conversations' active, typically a single one due to (1), i.e. server.js is waiting for the response from the request it made
(3) Therefore, as we know we sent a command, and without an underlying guaranteed delivery / acknowledgement mechanism, it order to try and continue processing (as apposed to simply failing), if we do not get an 'ok', I am examine the response and try to recover the situation. I have been through the Smoothie source code and believe the least-worst way is as discussed above:
(a) If "<Run" and we are blocked, then I have interpreted that as the Smoothie still has blocks in its buffer (up to 32 of them) to process and the 'ok' was lost and to send the next command, BUT let the rest of the server.js logic process the "<Run" as usual as it appears to use that to update the GUI?
(b) If "<Idle" the Smoothie has an empty buffer and process as per (a). This at least allows the job to continue and my testing to date indicates that this works, BUT I have not tested it on an actual device (Emblaser 2) yet.
I recognise that the G91 relative move is an issue, but that issue exists today if the original transmissions corrupt in a way that delivers a valid, but wrong message to the laser cutter (for example we sent go to X123 and X132 was received), and I note that the probability of this is not necessarily that high :-)
@cprezzi The code I am using is from here https://github.com/jeelabs/esp-link :-)
I agree we should only support robust communication paths. USB works. LW to comm server works over both Ethernet and wifi. ESP8266 Is a failed experiment; time to remove support.
@tbfleming I suggest that rather than removing ESP support we fix it (which is what I am working on) :-)
TCP-over-wifi is reliable. That means the problem lives entirely on the ESP side, so the fix must be entirely on the ESP side and shouldn't need any changes to lw-commserver. I wish I noticed this earlier; it would have cut this conversation a lot shorter.
2 likely causes:
@tbfleming thanks for the pointers
2: The ESP's data sheets and errata if they ever documented that chip? Ask the manufacturer?
I found a long discussion about problems with the esp-link transparent TCP-UART communication at https://github.com/jeelabs/esp-link/issues/46.
I belief that ESP-Link is just too big and complex for this simple task and therefore has not enough ressources left for the buffers. I would suggest to try simpler code like our old LW3 ESP8266 Serial Bridge (https://github.com/LaserWeb/LaserWeb3-ESP8266), but i'm not sure if this will be any better.
@cprezzi thanks for the pointer, very interesting reading; I will digest the thread and think about the approach.
@cprezzi I have spent a large amount of time trying to resolve this and have found the problem was deep in the Smoothie code, and was in part the way it interfaced with the underlying MBED libraries to the UART and interrupts and a whole lot more. Currently testing indicates that I have solved the issue (famous last words). The problem was not the ESP8266 nor LaserWeb, but I have moved from WebSockets to Telnet as I found WebSockets just too slow and have been able to run 5Mb and 10Mb jobs through to SmoothieV1 with no issues.
🎉 🎉 A-W-E-S-O-M-E 🎉 🎉
@DouglasPearless Could this imply a fix for the controversial raster issue (LW/USB bottleneck)?
It might make it better, I have not tried pushing the system to see how fast it goes, currently I am running the serial port at 250k with no problems on a direct USB<-serial->Smoothie UART and will wire back in the ESP8266 and see how fast I can get that to run, then I can try upping the speed, if memory serves me correctly, I think the ESP8266 can handle 921600 baud and the UART on Smoothie up to 3Mb/s
The EPS8266 might go faster than 921600 I would need to check
Gooood Work!
Thanks, Still a lot more lets-try-and-break-it testing to do, but it seems stable.
Scenario: (1) LW latest version running on OSX or Linux (Ubuntu) (2) Emblaser 2 using either ESP8266 websocket, or telnet (same situation for both connection types) - Note I have implemented telnet for ESP8266. (3) Connects correctly using either ESP8266 or telnet (4) Job starts and runs for a random length of time before stopping, sometimes runs to a successful completion (5) Reviewing the serial communications over the ESP to LW I observe that Smoothie goes from <Run to <Idle which according to Smoothie's firmware, means it has emptied its buffer and has not received any more data from LW
Thoughts?