ebb_motion.QueryPRGButton(self.serialPort) can fail and cause plot to fail

ShelMi commented 8 years ago

This is not related to recent changes regarding skip-pen-up-down, because it also occurs in v. 2.7.4 of eggbot.py. While attempting to debug the issue, I downloaded v. 2.7.5 so further debugging was/is with this version.
May well be timing related, but it IS reproducible ( perfectly, without fail in several dozens of trials) here...at the 2950th node count of layer 2 of the attached .svg file. Accessed thru the Layers tab of Eggbot Control.

A debugging statement inserted just before the python error issues at the last line in this snippet shows that strButton is in fact of zero length, so Python is correct in its error statement: "IndexError: string index out of range"

   strButton = ebb_motion.QueryPRGButton(self.serialPort)  #Query if button pressed
   if len(strButton) == 0:
       inkex.errormsg( 'NULL strButton' )

   if strButton[0] == '1': #button pressed [ERROR HERE WHEN strButton IS NULL]

cottage06 before stretch after anti-bulge+stretch.zip

oskay commented 8 years ago

This sounds like a possible new manifestation of a thing that we ran into earlier: http://forum.evilmadscientist.com/discussion/comment/1745

And... which I thought we had fixed as of rev 273b2 (Jan 31). If this is indeed a timing issue, then you should be able to alter it by changing the plotting speed.

It looks as though just checking to make sure that the array is nonzero length before getting element 0 would fix it?

ShelMi commented 8 years ago

I normally run with speeds when pen down = 300, when pen up = 400. I've just now tried doubling the speeds when pen is up and pen is down - no errors. Tried halving the speeds, also no errors. Back to original speeds - yes, the same old error. I've put the following code in place for my own protection, and it seems to work - i.e. it notifies me of the problem (which resolves with only a single retry required), but does not abort the plot:

[Dang, I can't seem to get the "insert code" function here to work right - I must be tired.]

` strButton = "" nLoopCnt = 0 while ( strButton == "" ) and ( nLoopCnt < 100 ): # Arbitrary number of retries nLoopCnt = nLoopCnt + 1 strButton = ebb_motion.QueryPRGButton(self.serialPort) # Query if button pressed

        if nLoopCnt > 1:
            if strButton == "":
                # Can't get a response from EBB, so
                # attempt to shut down in a way which allows user to continue.
                inkex.errormsg( 'Plot paused by communication error after node number ' + str( self.nodeCount ) + '.' )
                strButton = '1'         # simulate pushed button for pause
            else:
                # let the user/debugger at least know there was a problem,
                # though it appears we worked around it
                inkex.errormsg( 'FYI, no action required: loop count = %i, node count = %i' % ( nLoopCnt, self.nodeCount ))

        if strButton[0] == '1': #button pressed
            self.svgNodeCount = self.nodeCount;
            if nLoopCnt <= 1:
                inkex.errormsg( 'Plot paused by button press after node number ' + str( self.nodeCount ) + '.' )

            inkex.errormsg( 'Use the "resume" feature to continue.' )
            #self.penUp()  # Should be redundant...
            self.engraverOff()
            self.bStopped = True
            return

`

oskay commented 8 years ago

Unfortunately, I can't seem to reproduce it here. I've tried plotting from the same version, same file, same layer, same speeds... and no trouble found.

I'm not sure what else you might have different in your settings that could affect this. Possibly pen up/down positions and/or speeds, curve smoothing value, other settings? I'd suggest simplifying your file until you can isolate exactly the part that is causing the error (probably can be reduced to one or two paths only) for faster testing, and seeing which other variables have an effect.

oskay commented 8 years ago

I've tried some additional tests-- 2000 lines of different lengths and 2000 different spacings between them... still not yet able to actually cause a communication error there. Obviously you've found a very special case. :)

ShelMi commented 8 years ago

I've beat on this all day...no joy! 8^)

In summary, I find:

There are four paths each of which must be present for the failure to occur.
The only speed value which affects the failure is speed when pen is up, and it fails over a suspiciously specific range: 201 through 500 steps/sec inclusive, exactly and consistently.

Imbedded in the attached minimum-required-to-fail .svg are various comments I inserted over the course of my tests. I do believe this may be destined to be a mystery of the universe.

Osmosis as applied to the operations of a million monkeys being what it is, I think it may be best to guard against the dire results of this error - even though it's unrecreatable in California. The re-read I tried always succeeded, with just a single additional read.

Incidentally, my OS is Win 7 64.

test based on cottage06 before stretch after anti-bulge+stretch.zip

ShelMi commented 8 years ago

Maybe it doesn't take a million monkeys - same thing happened in an additional two different layers in that same file. I have not yet tested the repeatability of these two additional occurrences.

oskay commented 8 years ago

Hmmm.... Can you please tell me the additional parameters that you're using (pen up/down positions, speeds, delays, options etc) that might help me to reproduce this more exactly?

It sounds like this is a genuine communications issue, perhaps due to USB buffering during moves of specific lengths. Maybe try playing with the sleep time values, the 50 in the statements below

if (td > 50):
    time.sleep(float(td - 50)/1000.0)  #pause before issuing next command

Perhaps, try changing this to something like:

if (td > 15):
    if self.options.tab != '"manual"':
        time.sleep(float(td - 10)/1000.0)  #pause before issuing next command

(This is adapted from the AxiDraw software-- you will need to tweak the spacing, and so forth. Addition of the "manual" statement makes it more responsive during manual moves.)

ShelMi commented 8 years ago

Sorry for the delay in response, but this whole thing is getting more mysterious on at least a couple of levels. All I can do is tell you what I've done and the results of same, but there are "mysterious mysteries". 8^(

(1) Downloaded from github fresh versions of eggbot.py, eggbot.inx, ebb_motion.py, ebb_serial.py, plot_utils.py

(2) Tried your change, here is a copy of the code as I typed it in (sorry for the lack of spacing/tabbing - I haven't yet got the hang of inserting code here):

### if (td > 50): ### time.sleep(float(td - 50)/1000.0) #pause before issuing next command if (td > 15): time.sleep(float(td - 10)/1000.0) #pause before issuing next command Result: no change, still receive the villainous null string.

(3) Went extreme on the timing, tried this code:

### if (td > 50): ### time.sleep(float(td - 50)/1000.0) #pause before issuing next command time.sleep(float(td + 100)/1000.0) # looong pause before issuing next command

Result: no change, still receive the villainous null string.

(4) Placed inkex.errormsg statement in ebb_serial.py to print the value of the cmd string when the null gets returned. Python errored out, as there was no "import inkex" line in ebb_serial.py. Also placed an if statement to skip over null string error condition back in eggbot.py so as to not abort when the null comes in.

strButton = ebb_motion.QueryPRGButton(self.serialPort) #Query if button pressed if strButton != "": ### if strButton[0] == '1': #button pressed

(5) Add "import inkex" to head of ebb_serial.py. Now the errormsg gets logged ( command was 'QB'), BUT NOW a strange thing happens (see next item).

(6) Suddenly there are a bunch of new inkex.errormsg lines besides the command log I inserted. What's going on I says to meself! So I remove my command logging line, but leave in the 'import inkex' statement. What the hey??!! We now get a bunch of error outputs, which we had never seen before. Why didn't Python complain about the lack of 'import inkex' before?? I dunno. Nonetheless, we now get the following kind of stuff in our output log:

After command SM,3,0,-1 , Received bad response from EBB: 0 . After command SM,5,1,-1 , Received bad response from EBB: 0 . After command SM,7,2,-1 , Received bad response from EBB: 0 . After command SM,7,2,1 , Received bad response from EBB: 0 . After command SM,5,1,1 , Received bad response from EBB: 0 . After command SM,3,0,1 , Received bad response from EBB: 0 . After command SM,171,0,57 , Received bad response from EBB: 0 . After command SP,1,200 , Received bad response from EBB: 0 . After command SM,10,0,0 , Received bad response from EBB: 0 .

Where were these errors before? The only changes I've made are 1) add 'import inkex' to ebb_serial.py, and 2) skip the python error back in eggbot.py

(7) In addition, we note that these "bad response" errors occur only after the null response. We know this, because if we re-enable the python error in eggbot.py and therefore cease execution at that point, we get none of these "bad response" complaints. It's somehow as though a state machine has gotten derailed or something.

(8) I fully appreciate that you may be dubious about the quality of my reporting, as I've been in a similar situation to yours - with an unknown quality of reporter on the other end of the line. I've generally been an accurate observer in my professional life, but this particular programming environment is new to me. So...by all means take what I say with a grain of salt, but know that I do have experience and am doing my best to be an accurate observer.

(9) Oh, the info you requested about timing, etc. is in the attached .zip png If there's more info you want, lemme know. 8^) Eggbot timing tab.zip

oskay commented 8 years ago

I can't actually run the latest version of the file that you sent -- it gives me errors from your comments: lxml.etree.XMLSyntaxError: Double hyphen within comment: <!-- Speed when pen is up fails in the range 201-500 i, line 63, column 3

(Yes, it may be a good idea to add the "inkex" call there directly-- although FWIW, the "import inkex" line is actually redundant in eggbot.py.)

It looks as though something that you have done (or, something that my code has done, really!) has severely fouled up communication, and that initial "button press" error that you get is the symptom of it. It is likely a matter of overfilling the USB buffer, which is only one command deep. The sleep command is meant to introduce a delay in order to reduce the likelihood of buffer overflows, during known-long motions.

I have still not been able to reproduce the error here. The timing tab values are a clue, but there are other factors as well, including the pen up/down positions and curve smoothing that potentially affect these kinds of things. You could try playing with those values as well, to see if there is any kind of change.

ShelMi commented 8 years ago

I can't actually run the latest version of the file that you sent

Drat! I should have known better that to attempt to even insert a comment, without trying it first! 8^(

I have reduced the svg file to the barest minimum that still causes the error here. It consists of only four paths, comprising 14 nodes. Attached in zip (and certified with the usual money-back guarantee). 8^)

Two log files are also in the zip:

(1) "upspd400 dnspd300.txt" is a log of all commands and queries sent from ebb_serial.py. Speed when pen up = 400, when pen down=300. I have highlighted the many communication errors, most of which were not caught by the extant software.

(2) "upspd800 dnspd300.txt" is a log with the only change being that speed when pen up = 800. Not a misprint, I really do mean twice the pen-up speed of the first error-ridden log. Note there were no communication failures.

I would have guessed that if this were a matter of buffer overflow that the demands of the higher speed might have made the problem worse, rather than fixing it. Stranger things have happened, but I have to wonder...

Oh, I might also mention that the problem occurs both on direct connection to a USB 3 port, and connection to a USB 2 extender.

Anyway, if you tell me you consider there is no need to further investigate, I will cease worrying about it. Probably. 8^) svg and log files.zip

oskay commented 8 years ago

Would you please try again without the USB extender?

ShelMi commented 8 years ago

Repeated the non-extender USB 2 configuration. Results were exactly the same. 8^( Also tried doubling the pause-after-pen-up, and pause-after-pen-down. No joy. Tried increasing pen up position as high as 90. No joy.

BUT, I have happy-ish news too:

I expanded my error logging to make it easier to navigate, and finally noticed that the problems were all happening after a somewhat lengthy move - at line 65 in attached logs.

The first move after the lengthy move was responded to with a null (i.e. non-"OK") response. That's despite the fact that two intervening commands were responded to properly after the lengthy move(!) lines 68-73:

query:QB len(resp): 3, resp: 0 len(unused): 4, resp: OK

cmd: SP,0,400 len(resp): 4, resp: OK

After that the errors keep piling up. See line 76 in the zipped "sleep as usual.txt". Note that after this bogus response all subsequent QB queries have slipped, so the "OK" comes first, followed by the "0".

The attached log files show results from multiplying the sleep times in eggbot.py by different factors. (The factors are decipherable from the filenames.) Note I did not increase the expected travel time sent to the EBB.

The code which generated these different log files looks like this (the triple hash marks are just my mechanism for finding lines I've changed):

self.svgTotalDeltaX += xd self.svgTotalDeltaY += yd ebb_motion.doXYMove( self.serialPort, xd2, yd2, td ) ### td = td * 2 ### ### td = td + td / 8 ### ### td = td + td / 64 ### ### td = td + td / 32 ### ### td = td + td / 16 ### td = td + td / 4 ### if (td > 50): time.sleep(float(td - 50)/1000.0) #pause before issuing next command ` nDeltaX -= xd nDeltaY -= yd nTime -= td`

Conclusions:

On my machine, sleep time should be increased by at least 1/8.
While debugging eggbot.py, additional sanity checks on EBB responses may be useful.
ebb_serial.py might benefit by having an "import inkex" statement in it; otherwise any program attempts at flagging errors simply go into hyperspace.

Perhaps my machine - and by extension other machines - has a sleep mechanism which is inaccurate - though I have no special reason for suggesting this. My rig: Win 7, 64 bit, speedy 4-core, 12G RAM. Let me know what you think? 8^) sleep x various factors.zip

oskay commented 8 years ago

Okay, interesting! We have had various reports over the years from people with the "false button press" issue, and most of them have been traced ultimately to using overly long USB cables.

It sounds like the issue was indeed one of buffering-- as though it can handle one line of communication while managing a move but not two. (The purpose of the sleep is to avoid piling up commands while it is doing a long move.) Is is really the case that the sleep time should be increased by a fraction, rather than an additional constant offset?

ShelMi commented 8 years ago

Is is really the case that the sleep time should be increased by a fraction, rather than an additional constant offset?

I need to take more data to inform my judgment. I hope to have a better answer tomorrow.

ShelMi commented 8 years ago

[EDIT: A really quick check (a single try, as I'm headed out the door for a while), shows errors disappear after addressing issue (1) below. Good, though preliminary, news!! 8^) ]

In the process of taking data, I ran up against two anomalies regarding the delay time calculation. The anomalies are independent of each other, and one of them is data-dependent.

I don't yet understand how this could be causing the error we're trying to track down. I would have thought that even in the case of error, the EBB and eggbot.py would have the same error - and thus this would not cause a problem (aside from violating user's steps/sec. instructions). You would be in a much better position to judge.

If you agree the time calculation needs to be fixed, I would like to wait 'til it's fixed before I continue taking time data.

Here are the anomalies:

There is a line in eggbot.py in plotLineAndTime() as follows:

nTime = int( math.ceil( 1000 / self.fSpeed * plot_utils.distance( nDeltaX, nDeltaY ) ) ) My experiments in inkscape/python show that in this expression the term "1000 / self.fSpeed" is treated as an integer division, leading in general to times which are too short. In my case, 400-up-speed and 300-down-speed, this gives time-shortening errors on the order of 25-30%. This could likely be the cause of the "suspiciously specific range" causing failure that was mentioned above.
The same line contains the assumption of Euclidean distance as the measure to be used in determining the time. If I understand EBB correctly then the distance measure should actually be simply the max( abs(dx), abs(dy) ). In other words, the time to get from point 0,0 to 100,100 is actually only 100 / rate, rather than the 141 / rate that Euclid calculates. This would lead to data-dependent increases in time calculation of up to about 40%. [EDIT: Oh, now I see, I think...The intent is to keep the physical pen speed constant, rather than the number of steps per second constant, right?]

Does this make sense?? 8^)

ShelMi commented 8 years ago

Thoughts after a further day of testing:

(1) The divide mentioned in (1) above needs to be changed to a float. When it is so changed, raw error rate drops by about 80%. This does increase time required to complete a plot, but only because the previous method artificially and arbitrarily shortened the times.

(2) The divide fix does not, however, remove all communication errors. Nor does adding a reasonable constant to the sleep time. Which leads me to (3).

(3) The good news is that an ad hoc fix works really well: At each query and command where a response is received from EBB in ebb_serial.py, any time a null response comes in just throw it on the floor. I've tried it with some files that had a horrendous amount of errors (visible only when import inkex in ebb_serial.py, which is perhaps why they haven't raised their ugly snouts before.). In all tests, including even the case where the divide problem was still in place, the error rate dropped to 0! Every time a retry was required, the next read got the missing data just fine. And the horrible cascade of errors following these instances was completely gone.

(4) The distance calculation previously mentioned in my previous post as (2), as expected, does not affect the error rate. Thus the distance measure can be chosen to either have a specified pen travel rate (the currently used method), or a constant stepper motor rate - with no impact seen on error rate.

oskay commented 8 years ago

Can you please clarify a little bit what your ad hoc fix is?

ShelMi commented 8 years ago

It's only what I said in the next sentence. Throw away any null responses as though they never happened and just perform the read again. Doesn't take care of whatever the underlying problem is, but sure makes a big difference in the error rate.

In detail, in ebb_serial.py, in the command procedure, replace response = comPort.readline()

with

                response = comPort.readline()
                nRetryCount = 0
                while ( len(response) == 0 ) and ( nRetryCount < 100 ):
                    # get new response to replace null response if necessary
                    response = comPort.readline()
                    nRetryCount += 1

and similarly in the query procedure. I've never seen nRetryCount go above 1.

[I hate that I can't preserve indentation - could you share the "trick" with me? 8^) ]

oskay commented 8 years ago

All considered... that does sound like a reasonable approach. I don't have time to implement this right this moment, but I've added the "bug" label and assigned it to myself. Let me circle back on this and try it out-- hopefully we can get a permanent fix in the codebase.

For the spacing issue, use the "code fencing" method: three backticks on a line before and after your code. (More here, at Examples> Code)

ShelMi commented 8 years ago

Good. And, just so it doesn't get lost in all the responses here, here's exactly where it's needed to get rid of that unwanted and unexpected integer conversion:

In eggbot.py, the existing line is:

nTime = int( math.ceil( 1000 / self.fSpeed * plot_utils.distance( nDeltaX, nDeltaY ) ) )

and it wants to change to:

nTime = int( math.ceil( 1000.0 / self.fSpeed * plot_utils.distance( nDeltaX, nDeltaY ) ) )

And, thanks for the code fencing tip! 8^)

oskay commented 8 years ago

This seems to work well. It does not cause any problems that I yet see, and does improve communications on a computer here that was having unexplained trouble. (So, thank you!)

I've added the fix in the codebase, although it is not yet reflected in the published releases. (Anyone else reading: you can download the latest versions from the eggbot and plotink repositories and copy them into your extensions folder, if you need this fix.)

I'm going to go ahead and close this issue for the moment, with a mental categorization of "triaged." If you happen upon a more elegant solution, let's open this up again.

ShelMi commented 8 years ago

I just took a look at ebb_serial.py in the repository, and this is what I see. Wouldn't we want to put a null-response guard at the first read, as well as at the second?

def query( comPort, cmd ):
    if (comPort is not None) and (cmd is not None):
        response = ''
        try:
            comPort.write( cmd )
            response = comPort.readline() **Shouldn't this get the nRetryCount treatment also?**
            unused_response = comPort.readline() #read in extra blank/OK line
            nRetryCount = 0
            while ( len(unused_response) == 0 ) and ( nRetryCount < 100 ):
                # get new response to replace null response if necessary
                unused_response = comPort.readline()
                nRetryCount += 1
        except:
            inkex.errormsg( gettext.gettext( "Error reading serial data." ) )
        return response
    else:
        return None

oskay commented 8 years ago

Possibly so... Hmm...

oskay commented 8 years ago

Better? :)

ShelMi commented 8 years ago

:smile:

oskay commented 8 years ago

This new query version is apparently causing the manual "Check EBB Version" command to fail.

oskay commented 8 years ago

~~Adding a slight pause seems to fix it. https://github.com/evil-mad/plotink/commit/5a0dc0abb5ac603f2622ce37475966ec3c3d0119~~

Edit: No, that was a total fluke. I regret this commit.

EmbeddedMan commented 8 years ago

Is this a change that needs to be made to the EBB firmware?

On Tue, Jun 28, 2016 at 1:51 PM, Windell Oskay notifications@github.com wrote:

This new query version is apparently causing the manual "Check EBB Version" command to fail.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/evil-mad/EggBot/issues/44#issuecomment-229146175, or mute the thread https://github.com/notifications/unsubscribe/AAbeCKJ7DAETvH-ha2iuRJEwPCAIOca2ks5qQW0zgaJpZM4Ib0CB .

oskay commented 8 years ago

No; firmware is A-OK here. :)

oskay commented 8 years ago

Although, @EmbeddedMan , if you could go back in time and make each and every query end with the same <CR>OK<CR>, that would be mighty convenient. Latest version here includes a dictionary of which queries don't follow the standard form. (["v","i","a", "mr","pi","qm"] — did I miss any?)

EmbeddedMan commented 8 years ago

I will check for you. It would be very simple to make sure that all commands end in OK for the 'final' release. Is that a change you'd like to see, or would it be more trouble for you at this point than it's worth?

*Brian

On Tue, Jun 28, 2016 at 9:27 PM, Windell Oskay notifications@github.com wrote:

Although, @EmbeddedMan https://github.com/EmbeddedMan , if you could go back in time and make each and every query end with the same OK, that would be mighty convenient. Latest version here includes a dictionary of which queries don't follow the standard form. (["v","i","a", "mr","pi","qm"] — did I miss any?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/evil-mad/EggBot/issues/44#issuecomment-229239842, or mute the thread https://github.com/notifications/unsubscribe/AAbeCJ6ZjeXahBHUYPw68WSeArhAsRb6ks5qQdf3gaJpZM4Ib0CB .

oskay commented 8 years ago

I think that the dictionary method is probably sufficient. However, it would be more elegant to harmonize the commands.

It's a tricky question, because requiring that could "break" some of our software for machines out there with a lower firmware version. (OTOH, maybe we should try and build that easy installer for firmware upgrades...)

EmbeddedMan commented 8 years ago

We absolutely should try and build an easy updater.

In the meantime, it sounds like you don't want me to change the commands in your dictionary to have OK at the ends. Let me know if you do want this change.

As far as the commands that do not send back responses that end in

OK, I just went through the code and the only one you missed was "BL". *Brian On Wed, Jun 29, 2016 at 12:49 PM, Windell Oskay notifications@github.com wrote: > I think that the dictionary method is probably sufficient. However, it > would be more elegant. > > It's a tricky question, because requiring that could "break" some of our > software for machines out there with a lower firmware version. (OTOH, maybe > we should try and build that easy installer for firmware upgrades...) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > https://github.com/evil-mad/EggBot/issues/44#issuecomment-229434409, or mute > the thread > https://github.com/notifications/unsubscribe/AAbeCFYKRJkBJnBh-DTHqFbxD5DyV9Ksks5qQrA0gaJpZM4Ib0CB > .

evil-mad / EggBot

ebb_motion.QueryPRGButton(self.serialPort) can fail and cause plot to fail #44