LaserWeb / LaserWeb4

Collaborative effort on the next version of LaserWeb / CNCWeb
GNU Affero General Public License v3.0
706 stars 191 forks source link

lw.comm-server stops send data over a websocket / telnet session part way through a job #422

Closed DouglasPearless closed 6 years ago

DouglasPearless commented 7 years ago

Scenario: (1) LW latest version running on OSX or Linux (Ubuntu) (2) Emblaser 2 using either ESP8266 websocket, or telnet (same situation for both connection types) - Note I have implemented telnet for ESP8266. (3) Connects correctly using either ESP8266 or telnet (4) Job starts and runs for a random length of time before stopping, sometimes runs to a successful completion (5) Reviewing the serial communications over the ESP to LW I observe that Smoothie goes from <Run to <Idle which according to Smoothie's firmware, means it has emptied its buffer and has not received any more data from LW

Thoughts?

cprezzi commented 7 years ago

ESP8266 and Telnet communication are not yet fully tested. They are in a beta stadium and will be officially released in a future version.

What you describe sounds like a communication stabillity problem. The gcode file is buffered in lw.comm-server and sent line by line to the machine. The server first sends a few lines to fill the planner queue of the firmware and then waits for "ok" to send the next line (and so on). If not all ok's are received, the server gets out of sync and stops working when the planner queue gets empty. When that happens, you should be able to pause and resume to restart the command streaming. Can you check that?

DouglasPearless commented 7 years ago

Hi Claudio,

Yes, sometimes the pause-resume works, but most often it does not :-(

Is there anything I can do to help debug this?

Cheers Douglas

On 6/10/2017, at 10:41 PM, Claudio Prezzi notifications@github.com wrote:

ESP8266 and Telnet communication are not yet fully tested. They are in a beta stadium and will be officially released in a future version.

What you describe sounds like a communication stabillity problem. The gcode file is buffered in lw.comm-server and sent line by line to the machine. The server first sends a few lines to fill the planner queue of the firmware and then waits for "ok" to send the next line (and so on). If not all ok's are received, the server gets out of sync and stops working when the planner queue gets empty. When that happens, you should be able to pause and resume to restart the command streaming. Can you check that?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LaserWeb/LaserWeb4/issues/422#issuecomment-334708547, or mute the thread https://github.com/notifications/unsubscribe-auth/ADkh-0EDHqbUpY3EiF66xiqnCLi-FwbLks5spfXTgaJpZM4Pv4jG.

cprezzi commented 7 years ago

It probably would help if you could send me a log-file of the communication.

DouglasPearless commented 7 years ago

Hi Claudio,

In reading the lw-comms.server source code I noticed what appeared to be various timeouts so I went:

cd /Applications/LaserWeb.app/Contents/MacOS
export LOG_LEVEL=3
export SMOOTHIE_WAIT_TIME=5
./LaserWeb

and I was able to run a 5 minute job for the first time end-to-end. :-)

I then tried a very large 50mb file and it ran for quite a while then stopped, I could not pause-resume to make it continue and I did notice this in the logfile.txt:

2017-10-06 10:53:39 Sent: X208.00 S0.0967 Q: 3120068
2017-10-06 10:53:39 Telnet: X208.00 S0.0967
2017-10-06 10:53:39 Telnet: 
2017-10-06 10:53:39 Telnet: ok
2017-10-06 10:53:39 Sent: X208.10 S0.0810 Q: 3120068
2017-10-06 10:53:39 Telnet: X208.10 S0.0810
2017-10-06 10:53:39 Telnet: 
2017-10-06 10:53:39 Telnet: get statuos
2017-10-06 10:53:39 Telnet: k
2017-10-06 10:53:39 Telnet: <Run,MPos:207.2342,56.2275,10.0000,WPos:207.2342,56.2275,10.0000>
2017-10-06 10:53:39 Telnet: 
2017-10-06 10:53:40 Telnet: get status
2017-10-06 10:53:40 Telnet: 
2017-10-06 10:53:40 Telnet: <Idle,MPos:208.1000,56.2500,10.0000,WPos:208.1000,56.2500,10.0000>
2017-10-06 10:53:40 Telnet: 
2017-10-06 10:53:40 Telnet: get status
2017-10-06 10:53:40 Telnet: 
2017-10-06 10:53:40 Telnet: <Idle,MPos:208.1000,56.2500,10.0000,WPos:208.1000,56.2500,10.0000>

of particular interest is:

2017-10-06 10:53:39 Telnet: get statuos
2017-10-06 10:53:39 Telnet: k

Thoughts?

I can send the whole logfile.txt if you want :-)

DouglasPearless commented 7 years ago

I just re-ran the large job and the same thing:

2017-10-06 11:10:05 Telnet: ok
2017-10-06 11:10:05 Sent: X288.30 S0.5647 Q: 3120068
2017-10-06 11:10:05 Telnet: X288.30 S0.5647
2017-10-06 11:10:05 Telnet: 
2017-10-06 11:10:05 Telnet: get statutsk
2017-10-06 11:10:05 Telnet: 
2017-10-06 11:10:05 Telnet: <Run,MPos:287.8671,50.0313,10.0000,WPos:287.8671,50.0313,10.0000>
2017-10-06 11:10:05 Telnet: 
2017-10-06 11:10:05 Telnet: get status
2017-10-06 11:10:05 Telnet: 
2017-10-06 11:10:05 Telnet: <Idle,MPos:288.3000,50.0500,10.0000,WPos:288.3000,50.0500,10.0000>
2017-10-06 11:10:05 Telnet: 
2017-10-06 11:10:05 Telnet: get status

If looks like (perhaps) there needs to be some of retry X times if it does not get an 'ok' ?

cprezzi commented 7 years ago

The first log looks like the o was received before the last s of "status". In the second case the LF after status and the o was completely missed. This shows that there is a general communication problem that could also affect the sent gcode. I'm not even sure if telnet is a good choice for such a file streaming.

cprezzi commented 7 years ago

Did you try the normal ESP8266 communication via websockets?

cprezzi commented 7 years ago

The ok is absolutely needed with serial (RX/TX over ESP8266) as this is the only flow control.

cprezzi commented 7 years ago

USB communication has additional hardware flow control that avoids buffer overflow.

DouglasPearless commented 7 years ago

OK, same problem occurs over WebSockets, could a relatively simple solution (without knowing the source code :-) ) be something like this pseudo code:

while running
  if (return_code_from_smoothie <> "ok")  //note it is likely that the command sent was executed but the reply lost
    {
        if ((status == "<Idle") ||(status == "<Run"))  //note this is just the first few characters of the string returned
             {
                    return_code_from_smoothie = "ok";   //clean up the lack of flow control
              }
    }
cprezzi commented 7 years ago

Nope, this will not work. You will get out of sync and just overflow the firmware buffer. The status report (run/idle/position) is queried by a timer independant of how fast the gcodes can be executed. So you could get multiple status reports while waiting for one gcode to finish. If you are sending new gcode lines on each idle or run status, you will overflow the firmware buffer.

It doesn't make sense to tinker. You need to solve the communication problem (scrambled / lost digits). I belief this is caused by the ESP8266 code, because it does no flow controll at all.

Do you gather received characters until you get a \n and then send the hole line in one, or just path through every received character?

cprezzi commented 7 years ago

By the way: You can connect a Raspberry Pi to the USB port of the machine, execute the lw.comm-server on the Pi and run the frontend on any network pc. This works stable.

jorgerobles commented 6 years ago

It's a ESP8266 shortage issue. Closing.

DouglasPearless commented 6 years ago

I would like to re-open this issue as I have found a workable solution.

In summary, if for what ever reason, the communication is corrupted or lost back from the Smoothie over WiFi for either Websockets OR telnet, server.js has no mechanism to recover from this situation, it always assumes that it will receive an 'ok' and stops transmitting if it does not. Sometimes a pause-resume would fix the situation, but often it did not.

After working on this for some three weeks on and off I have a solution that allows server.js to recover from incomplete or lost transmission replies back from Smoothie.

This has been tested with a >50MB file over telnet and WiFi which took more than 5 hours to run and previously never got more than 5% of the job without a communications failure.

I propose to submit a pull request with this new code for review and hopefully including into LaserWeb.

Broadly speaking, here is the key logic changes to server.js which are the last two if' statements below:


                    var data = '';
                    while (responseArray.length > 0) {
                        data = responseArray.shift();
                        writeLog('Telnet: ' + data, 3);
                        if (data.indexOf('ok') === 0) { // Got an OK so we are clear to send
                            if (firmware === 'grbl') {
                                grblBufferSize.shift();
                            }
                            if (firmware === 'repetier' || firmware === 'marlinkimbra') {
                                reprapBufferSize++;
                            }
                            blocked = false;
                            send1Q();
                        } else if (data.indexOf('<') === 0) { // Got statusReport (Grbl & Smoothieware)
                            if (blocked === true && data.indexOf("<Run") === 0) { 
                            //if we are still blocked and we get "<Run" then we lost the 'ok' back from Smoothie and need to recover
                            //  as while Smoothie has stuff in it's buffer, we have stopped sending and have more to send
                                console.log("Fixing <Run =>"+data);
                                blocked = false;
                                send1Q();
                            }
                            if (blocked === true && data.indexOf("<Idle") === 0) {
                             //if we are still blocked and we get "<Idle" then we lost the 'ok' back from Smoothie and need to recover
                             // as Smoothie's internal buffer is now empty and we have more to send
                                console.log("Fixing <Idle =>"+data);
                                blocked = false;
                                send1Q();
                            }

Cheers Douglas

tbfleming commented 6 years ago
//if we are still blocked and we get "<Run" then we lost the 'ok' back from Smoothie and need to recover
//  as while Smoothie has stuff in it's buffer, we have stopped sending and have more to send

This is definitely not true. Please read up on sync vs. async messages.

DouglasPearless commented 6 years ago

ok, I have done a quick search of the wiki and cannot find that information, where is that documented? :-)

tbfleming commented 6 years ago

Quick search doesn't work well with smoothie docs unfortunately. Smoothie's protocol is similar to grbl's "simple" streaming protocol with differences primarily in the async commands. grbl's doc: https://github.com/gnea/grbl/wiki/Grbl-v1.1-Interface

DouglasPearless commented 6 years ago

Ah, you are referring to the Smoothie documentation; I work with, and develop with Smoothie firmware and am usually deep in the code with a JTAG adding new features :-)

If I understand what you are saying, you are referring to the kernel.cpp program, specifically:

// return a GRBL-like query string for serial ?
std::string Kernel::get_query_string()
{
    std::string str;
    bool homing;
    bool ok = PublicData::get_value(endstops_checksum, get_homing_status_checksum, 0, &homing);
    if(!ok) homing = false;
    bool running = false;

    str.append("<");
    if(halted) {
        str.append("Alarm");
    } else if(homing) {
        running = true;
        str.append("Home");
    } else if(feed_hold) {
        str.append("Hold");
    } else if(this->conveyor->is_idle()) {
        str.append("Idle");
    } else {
        running = true;
        str.append("Run");
    }

Is that correct?

ghost commented 6 years ago

@tbfleming Can you possibly be a bit less cryptic and offer Douglas some direct information rather than telling him to go away and rtfm? He is trying to help improve Laserweb and i'm sure you should appreciate any developer willing to do that?

DouglasPearless commented 6 years ago

@tbfleming I am rather familiar with a lot of the Smoothie source code and can see where and when the "Run" and 'Idle" responses are generated and why (due to a query request a '?' in GRBL parlance, or a "get status" request.

If Smoothie's conveyor (its internal processing of "blocks") has run out of things to process it returns "Idle" and "Run" if it is still running.

I was using this as the basis of my analysis of server.js and developed the solution above :-)

If I am missing something here (quite possible as only have a single coffee today), then let me know so I can re-work the solution :-)

Cheers Douglas

tbfleming commented 6 years ago

Here's how grbl has behaved since before smoothie forked from it. Did smoothie change this?

When this happens, the above patch will resume sending, overflowing buffers.

@tbfleming Can you possibly be a bit less cryptic and offer Douglas some direct information rather than telling him to go away and rtfm? He is trying to help improve Laserweb and i'm sure you should appreciate any developer willing to do that

I can't get any less cryptic than saying that won't work. Maybe I shouldn't say RTFM with smoothie since there's not much M to R, other than for end users.

tbfleming commented 6 years ago

Repeating what @cprezzi said, which still stands:

Nope, this will not work. You will get out of sync and just overflow the firmware buffer. The status report (run/idle/position) is queried by a timer independant of how fast the gcodes can be executed. So you could get multiple status reports while waiting for one gcode to finish. If you are sending new gcode lines on each idle or run status, you will overflow the firmware buffer.

DouglasPearless commented 6 years ago

@tbfleming now I better understand what you are telling me, especially around the buffer overflow and from here I will go back and review how Smoothie is handling its flow control and then report back here, and as required, update my proposed solution.

tbfleming commented 6 years ago

Here's a specific example where it will go wrong, even if Smoothie's ? handling is not nearly as async as grbl's is:

send: ?G1X100
recv: <Run...>
recv: OK

The first recv will throw things off with that patch.

DouglasPearless commented 6 years ago

Thanks for the example, it helps clarify my understanding.

In terms of all the use-case that I can find in reading the source code for Smoothie, below are the ones where an 'ok' is sent back:

Reasons to reply “ok”, in response to: (0) Initial connection to Smoothie (1) Empty string received (2) M110 (Set Current Line Number) (3) M999 when Smoothie is HALTED (4) G53 code (both a valid code and an invalid code) (5) M28 code (Start an upload) (6) M29 (both if the upload completed successfully or writing the file failed) (7) M112 (Emergency stop) (8) M117 (send arbitrary text to the UI Panel (9) M500 (save volatile settings to config-override) (10) M502 (deletes config-override so everything defaults to what is in config) (11) M504 (save to specific config override file) (12) M1000 (pass through lower case commands) (13) A comment (line starts with a ‘;’) (14) A valid G or M code (15) I am not 100% clear on the GRBL responses; still researching that.

I am currently reviewing how the buffer full situation is handled and will post that shortly so I can unsure I am catering for all (known) use-cases.

Cheers Douglas

tbfleming commented 6 years ago

Don't forget to account for ESP8266 buffer overflow. It's probably what's setting off this whole mess.

DouglasPearless commented 6 years ago

I agree; the Arduino websocket sketch has an internal buffer of 1460 characters; the esp-link code i am using for telnet has a 1460 receive and a 1460 transmit buffer and in Smoothie, it has a ring-buffer appears to allow up to 255 characters per item in the ring buffer (unless I have mis-read the source code). The fundamental issues are: (1) What to do if a transmission is corrupted or lost (2) How to reliably handle when the buffer is full is handled as well.

tbfleming commented 6 years ago

Related to (1): how to reliably detect if a transmission is corrupted or lost and which direction it was lost in. Consider what happens when a G0 or G1 is lost while in G91 mode.

DouglasPearless commented 6 years ago

Hmm, a good point and without proper handshaking, guaranteeing the delivery and confirming receipt it is rather difficult. :-)

The (probably) best way is to is add a sequence number and a CRC to each transmission and track the "lines" sent, that would require changes to Smoothie and server.js which at the moment I do not have time to properly implement at the moment; I think the interim solution I have proposed is (IMHO) a step in the right direction as it seems to work well as rather than simply lose a job, it tries to keep processing; thoughts?

tbfleming commented 6 years ago

I don't believe it's a step in the right direction since it overruns buffers.

cprezzi commented 6 years ago

Just to remind: If we would have a "handshaking" problem, we sould see that also with USB serial connection, which we don't!

The problem you see (missing/scrambled chars) is probably caused by the ESP8266 code and has to be solved there. Can you provide the code you are using on the ESP8266?

cprezzi commented 6 years ago

@DouglasPearless Adding CRC and/or sequence numbers would degrade the data to overhead ratio, which results in lower reachable feeds (escpecially on raster engraving). Not very welcome!

DouglasPearless commented 6 years ago

@cprezzi actually the USB protocol has quite extensive error control and recovery built in :-)

Considerable error checking and error handling features have been built in to the USB to ensure that it is a reliable method of connecting peripherals to a PC. Data integrity should be comparable to that of an internal expansion bus.

Immunity from data corruption by noise and spikes has been provided by the use of differential logic drivers and shielded cabling. When errors do occur, cyclic redundancy checks (CRCs) performed separately on both the control and data fields of packets will enable 100 per cent recovery of both single and double bit errors. Unrecoverable errors can be detected with a high degree of confidence.

A self-recovery mechanism is built into the messaging protocol, with time-outs for lost and invalid packets. Some error recovery is built into the hardware. The host controller will retry a failed transaction three times before reporting an error to the client software. How a reported error is dealt with is the responsibility of the client software.

Interrupt and bulk data transfers conclude with a handshake packet to provide confirmation that the data was received, or request that it be re-sent if it was not. Delivery of this data is therefore guaranteed, even if the time taken to deliver it is not.

There is nothing like that for what is effectively a UART serial connection, and I believe is an issue as any WiFi dropout will effectively corrupt the transmission, whether a buffer issue (e.g. ESP8266) or an Access Point issue, WiFi channel congestion, etc etc.

So we have a very nice reliable USB connection available, but the serial LW connection, irrespective of the protocol (telnet, web socket, etc) is not providing the same level of robustness our users would expect; that is the issue I want to solve.

More than happy to explore further and totally agree about not want to degrade the throughput so need a smart solution for this (long term).

I guess that beings me full circle in the logic of my response, specifically that I believe need to plan and introduce a (perhaps staged) approach to making LW as reliable over its serial (WiFi) connection as it is over the USB connection;

Thoughts?

Cheers Douglas

DouglasPearless commented 6 years ago

@tbfleming Do you mean the ESP-Link buffers over-running?

If so, my thoughts are (and correct me if I have got this wrong :-) ):

(1) LW (server.js) uses "ping-pong" communication where and 'ok' is received back after each transmission

(2) The assumption that I am using is that there is only ever a very small number of 'conversations' active, typically a single one due to (1), i.e. server.js is waiting for the response from the request it made

(3) Therefore, as we know we sent a command, and without an underlying guaranteed delivery / acknowledgement mechanism, it order to try and continue processing (as apposed to simply failing), if we do not get an 'ok', I am examine the response and try to recover the situation. I have been through the Smoothie source code and believe the least-worst way is as discussed above:

(a) If "<Run" and we are blocked, then I have interpreted that as the Smoothie still has blocks in its buffer (up to 32 of them) to process and the 'ok' was lost and to send the next command, BUT let the rest of the server.js logic process the "<Run" as usual as it appears to use that to update the GUI?

(b) If "<Idle" the Smoothie has an empty buffer and process as per (a). This at least allows the job to continue and my testing to date indicates that this works, BUT I have not tested it on an actual device (Emblaser 2) yet.

I recognise that the G91 relative move is an issue, but that issue exists today if the original transmissions corrupt in a way that delivers a valid, but wrong message to the laser cutter (for example we sent go to X123 and X132 was received), and I note that the probability of this is not necessarily that high :-)

DouglasPearless commented 6 years ago

@cprezzi The code I am using is from here https://github.com/jeelabs/esp-link :-)

tbfleming commented 6 years ago

I agree we should only support robust communication paths. USB works. LW to comm server works over both Ethernet and wifi. ESP8266 Is a failed experiment; time to remove support.

DouglasPearless commented 6 years ago

@tbfleming I suggest that rather than removing ESP support we fix it (which is what I am working on) :-)

tbfleming commented 6 years ago

TCP-over-wifi is reliable. That means the problem lives entirely on the ESP side, so the fix must be entirely on the ESP side and shouldn't need any changes to lw-commserver. I wish I noticed this earlier; it would have cut this conversation a lot shorter.

tbfleming commented 6 years ago

2 likely causes:

  1. The code running on the ESP is buggy. Look hard at it because it's your best option.
  2. Its wifi silicon has a major bug. e.g. of a major bug: not including wifi's ECC coding to save transistors. There's no way we should complicate lw-commserver to accommodate that shortcoming, if it's present.
DouglasPearless commented 6 years ago

@tbfleming thanks for the pointers

  1. I will look into the esp-link code :-)
  2. Where can I find more details of this? :-)
tbfleming commented 6 years ago

2: The ESP's data sheets and errata if they ever documented that chip? Ask the manufacturer?

cprezzi commented 6 years ago

I found a long discussion about problems with the esp-link transparent TCP-UART communication at https://github.com/jeelabs/esp-link/issues/46.

I belief that ESP-Link is just too big and complex for this simple task and therefore has not enough ressources left for the buffers. I would suggest to try simpler code like our old LW3 ESP8266 Serial Bridge (https://github.com/LaserWeb/LaserWeb3-ESP8266), but i'm not sure if this will be any better.

DouglasPearless commented 6 years ago

@cprezzi thanks for the pointer, very interesting reading; I will digest the thread and think about the approach.

DouglasPearless commented 6 years ago

@cprezzi I have spent a large amount of time trying to resolve this and have found the problem was deep in the Smoothie code, and was in part the way it interfaced with the underlying MBED libraries to the UART and interrupts and a whole lot more. Currently testing indicates that I have solved the issue (famous last words). The problem was not the ESP8266 nor LaserWeb, but I have moved from WebSockets to Telnet as I found WebSockets just too slow and have been able to run 5Mb and 10Mb jobs through to SmoothieV1 with no issues.

jorgerobles commented 6 years ago

🎉 🎉 A-W-E-S-O-M-E 🎉 🎉

jorgerobles commented 6 years ago

@DouglasPearless Could this imply a fix for the controversial raster issue (LW/USB bottleneck)?

DouglasPearless commented 6 years ago

It might make it better, I have not tried pushing the system to see how fast it goes, currently I am running the serial port at 250k with no problems on a direct USB<-serial->Smoothie UART and will wire back in the ESP8266 and see how fast I can get that to run, then I can try upping the speed, if memory serves me correctly, I think the ESP8266 can handle 921600 baud and the UART on Smoothie up to 3Mb/s

DouglasPearless commented 6 years ago

The EPS8266 might go faster than 921600 I would need to check

jorgerobles commented 6 years ago

Gooood Work!

DouglasPearless commented 6 years ago

Thanks, Still a lot more lets-try-and-break-it testing to do, but it seems stable.