SOMETHING is rotten... in RoboPaint!

techninja commented 9 years ago

It's always on stage with demos that the real demons in the system come out. And it's always in demos that you can't figure out WHY these things are breaking. Here's a quick run down of experiences on stage using a fast brand new MacBook Pro:

Drawing simple drawing, node-webkit-helper running hot, at ~120-150% cpu usage.
Starts/Stops. Pauses for too long at water, before painting, or during a stroke.
- Might be the crazy polling from scratch? (Likely not)
- Fairly repeatable. Drawing either big or small, all along the progression of the job.
Drawing with the scratch API occasionally ends up with HUGE wait times, then incorrect path drawing. Total mess even. This likely has to do with incorrect initialization after doing a regular drawing.
- A restart of the program helps.
- We never seem to test the scratch API AFTER doing a painting in RP, this needs to be one of the tests.

These issues are likely a mix between CNCServer and RP. I'll do more testing to try and work out exactly WHAT they encompass, but really they're totally unacceptable and shouldn't be issues.

oskay commented 9 years ago

I've had some similar experiences in terms of awkward pauses and jerkiness, but always attributed it to my not having the buffer smoothing values not set well enough.

I have not seen the incorrect path drawing issue. That certainly merits the "critical" label!

The next thing on my plate here (still clearing book of plate...) is working through the scratch tests one more time prior to publicizing. Great to have caught this now.

techninja commented 9 years ago

We did end up giving a very good demo, it just failed miserably before I restart RP. This tied a regular drawing first with a followup Scratch drawing, and this discovered the issue. Very good to find it, but it's good to re-emphasize that we're in need of more and better testing.

techninja commented 9 years ago

Part 1 of the issue: "NW-helper running hot" appears to be a common and known issue:

Even leading the incredibly popular light-table project to move over to Atom! Ouch!! Apparently it's partly caused by simply having the application run on a system for a number of hours, which was exactly the situation we had on stage, getting everything ready hours before the presentation, then leaving it there and on.

developer_tools_-_file____applications_robopaint_app_contents_resources_app_nw_resources_main_html

CPU Profiles show the app while idle.. really is totally idle and it's not our code that's at fault (probably), it's theirs. If we can prove this happens reliably on a fresh program start, then I suspect we have real issues, till then, we're going to need to wait for the NW community to come up with a better answer, and it's likely we need to simply ensure our Mac users start with a fresh application launch before prints :disappointed:.

Looking into the rest now.

techninja commented 9 years ago

I just ran a slew of tests with the exact same scratch, RoboPaint, computer and hardware setup. RP drawing, then scratch drawing, cancelled, resumed, back and forth. No issues whatsoever. :angry:

The scratch "api" really is just a wrapper for the rest of the known good working systems, and very little can go wrong beyond the turtle reset position. I suspect the positional issues I encountered might have been an issue of the actual scratch blocks themselves. The fact that I can't seem to recreate the issue somewhat tilts me in that direction anyways.

oskay commented 9 years ago

Is there any plausible way to detect when it begins "running hot?"

techninja commented 9 years ago

A good question. I'm sure the answer is "probably some way to do it", of course the simplest method would be to do test to find out how different systems react over time, and possibly alert the user with a stupid message about suggesting a program restart. The weird thing is, according to the creator of nw.js on those linked issues above, the cause is actually somewhat unknown. It may very well be because of some code of mine (possibly around the queue checking loop) that's causing it. Sounds like lots more testing and adjustment.

Some general good news: I've refactored how path checking works, so for fast print speeds, we'll no longer run into buffer under-runs (sudden stops as it generates the rest of the tool path immediately ahead of itself), as the entire buffer should be built before printing starts. It turns out we end up making it take 10-20x longer by allowing work to happen between loop iterations. If we just let the program hang with a nice message like "Processing drawing, please wait...", 90% of the time that will only hang for possibly 5-7 seconds and all the tool path processing will be completely done, and all that's left is to print it.

oskay commented 9 years ago

Looping in https://github.com/techninja/cncserver/issues/60 as a related "blocking" issue before we declare the new APIs "good to go."

oskay commented 9 years ago

Have you had a chance to look into this at all?

techninja commented 9 years ago

I have, and have not. I've come up with some wide reaching solutions, but both it's been out of the abilities of my helpers, and I've been up to my eyeballs both catching up with life and trying to find a job, helping with Sylvia's second book, and helping with Tina's pneumonia she's had for over a month, only just now recovering.

I met with @rogerfachini & @docprofsky about a week ago and set their way forward for helping 0.9.6, and also discussed my general plan forward. Big improvements are coming to cncserver through its new strict branch and automated testing to clear out the bats in the belfry for scratch support. While touring through Australia we ran the scratch demo with very few of the same issues (usually with a fresh open of RoboPaint seems to help), but did encounter a few, um... unpredicted artistic additions:

Even though many students preferred the strange offsets it seemed to give itself, it'd be much better if these were actually programmed to do so. Looking at this it should give me a subtle clue as to what is actually going on. Notice the offsets are mathematically precise and aligned to the drawing, this is with the standard unmodified watercolorblocks Sierpinski gasket example.

oskay commented 9 years ago

Yes, that is beautiful. And yes, it's awful too. :(

Anything that I can do to help?

techninja commented 9 years ago

I'm going to continue forward with my plan for the wide/general fixes, but in the mean time unless I know exactly what the issue was, I won't have proof that it's actually gone.

If we can somehow piece together some kind of simplified "tests" that can nail this down from random known occurrence (where it is with me now), to something repeatable, dissecting this down to component parts to make it break in the same way given the same circumstances, I can then not only likely confirm the exact cause of the issue, but use these "tests" to confirm that it's been defeated & fully fixed.

We likely need to first rule out whether these issues occur solely after a RP standard auto-print, or if they can occur with a fresh launch of RP being used only to draw from scratch. Most of my simple testing has shown that a fresh launch has far fewer "state confusion" issues, but I also have this feeling that the example above was done from a fresh open of RP, but with little rhyme or reason as to why it output in the way that it did and not other times the same program was run :/

oskay commented 9 years ago

Can you say anything at this point about what known circumstances cause the problem? Is it, for example only with the scratch API, or does it happen at any time, no matter what? If I can reproduce it once, then I may be able to help narrow down the causes.

techninja commented 9 years ago

I believe the issue resides within the scratch API executor, or at least its assumed state/position about the "turtle". Though it is not a simple program, I'd recommend testing with the Sierpinski Gasket.sb2 example file with a pen, as it has been clearly proven to exhibit the issue, and can be modified to simplify it and hopefully help pinpoint the issue. I would test with the 0.9.5 RP, with both fresh run, and having "painted" something else through the interface (though as I've said, I suspect the issue to exist regardless of RP's "freshness").

oskay commented 9 years ago

Tried to set up to test this, but currently unable to get RoboPaint to run whatsoever...

oskay commented 9 years ago

Added a note on that last issue (#219).

Robopaint seems to indicate the cursor position while being "remotely controlled" -- it would be helpful both as a user and for debugging if it would "paint" the positions that it goes through there. Many CNC control programs do something like this, to show a non-saveable record of what the tool has done.

oskay commented 9 years ago

I have been able to reproduce the problem with the Sierpinski Gasket.sb2 example, and I've run a dozen or so related tests.

There seems to be no loss of CNCserver position information, just loss of turtle position.
The overall behavior feels as though certain commands are being "dropped" -- not received by CNCserver.
Question: Are the dropped commands not being received at all, or overflowing at the end of a buffer, or being received malformed?
These "dropped" commands include move commands as well as occasional pen up/down commands.
There is some pausing while drawing -- presumably due to buffering issues -- but it does not seem clearly correlated to the complexity of the drawing, nor to the dropping of commands.
I have been able to get it into a mode where the XY motion still works, but the brush does not respond to manual commands. Trying to raise or lower the brush gives a "Connection Problem :(" status. Disconnecting and reconnecting the WCB, or running the scratch program seems to reset it.

techninja commented 9 years ago

I can corroborate the suspicion of "dropped" commands. A missing move command is far more common than a pen up/down command, likely only because there are more of them for certain tests. Not to mention that would at least partially explain the visual results. I've also seen the up/down state lock issue, which I believe I have a fix already in place on the new CNCServer branch.

One thing we need to do is verify that the problem does not lie with Scratch itself. There is a possibility that under certain circumstances scratch doesn't actually finish sending one command before moving on to the next when moving quickly through instructions. This could still be a problem with CNCServer being non-compliant, but we won't know without a consistent test case.

To boil this down, to problems here remain:

[Possible] Intermittent commands drop or do not make it to the buffer.
- To confirm: Need to get it to happen consistently, and possibly for example, try to draw squares and have them turn into triangles.
- Once confirmed: I need to take test case files and dig through to find where the actual problem lies (with scratch, or with cncserver)
[Confirmed] When buffering commands, execution intermittently pauses for extended periods.
- When I get to a good moment after finishing the current cncserver branch, I'm going to see if I can't get this to happen on my linux dev machine outside of RP. If I can, likely find the root cause and get this fixed. I'm generally suspecting the way that loops are being handled means that if one part of the program is doing too much, it will start blocking other parts of the app.
[Confirmed] Brush stops responding to up/down until restart.
- Again, I'm still not totally sure when this happens, but I have seen it happen. If I can get some kind of clear repeatable steps, I have a good chance of proving I've fixed this. I'll attempt to find this on my own and will report back, but please feel free and do some searching on your own as well.

I'm going to work through this week finishing writing my automated tests and get this branch to master, then will be rolling a 0.9.6 pre-release for you to test for this specifically. Hopefully we can test quickly that the issues have abated, or are untouched and I can continue to improve from there.

oskay commented 9 years ago

Running the same test figure repeatedly seems to give consistently different results. I don't think that we will be able to find a genuinely consistent failure mode.

Given that constraint, what can be done? Is there any way to directly intercept and log exactly what Scratch is sending?

oskay commented 9 years ago

I'm a little out of my depth here, but I thought that I might be able to use a http "sniffing" tool like mitmproxy/mitmdump or at least tcpdump to record the traffic being sent out by Scratch.

It appears that Scratch is constantly polling port 4242, from the moment that the file is opened, to the moment that it is closed. This appears to be the "30 Hz" polling. It's nearly impossible to see anything aside of that. The load gets even higher (of course) while plotting.

The port that Scratch is using seems to hop around on occasion, especially when things get busy.

If RoboPaint is not running, the localhost port that Scratch is using is incremented with each poll attempt... suggesting that the port changes when there is an error. Could it be that what we're seeing in the Turtle-Drops (tm) is correlated to these shifts in ports?

techninja commented 9 years ago

Totally not out of your depth as much as you think! That's a very good tactic for seeing what's being done under the hood. I hadn't yet decided it was worth it to try, but that's a great debug method, less the difficulty in filtering to just what you want.

The port change is worrying, possibly pointing to these requests stacking up on top of each other until they overflow and the port is changed, possibly leading to what we're seeing. We should likely test in scratch online with the following bookmarklet:

javascript:(function(){var%20s=document.createElement('script');s.setAttribute('src','https://cdn.rawgit.com/techninja/cncserver/8a95b420519e09a6692177a8a4d5fa87fc26b54c/watercolorbot_scratch.js');document.body.appendChild(s);}());

This should (A.) ensure the commands sent are traceable in the browsers devtools (hopefully), and (B.), likely limit the amount of "strange behavior" that could be working behind the scenes. If we can confirm the problem exists just as well with the online version of scratch, that the problem likely resides fully on CNCServer, but if the problem is considerably different/absent in the online version, than the issue likely resides with Scratch offline, and possibly because these extensions are still considered "experimental", and may require us to change the way that they work due to "recent" changes in how these things work.

I'm not sure if there are ANY other scratch extension projects as ambitious as ours, attempting to duplicate the full drawing output to a real robot, and keep everything straight with the UI. One missed command and we're fudged. If our testing hits a wall, I'd suggest we move to the Scratch offline editor Github issue queue with our findings and bounce our hypothesis off them. They may have some insight into the inner workings that may explain this behavior quite quickly.

oskay commented 9 years ago

I have not checked whether Snap has the same kinds of issues. If not, and if it is free of both errors and polling, that could be another good data point.

oskay commented 9 years ago

Testing this morning!

Setup: RoboPaint v. 0.9.6 (electron). Scratch Offline Editor (v. 437)
Monitoring via Wireshark
Sketch: Sierpinski gasket example, 6 iterations.
- Run 1: One port jump during plot. No apparent disruption to plot.
- Run 2: One port jump during plot. No apparent disruption to plot.
- Run 3: Wasn't watching wireshark. No apparent disruption to plot.
- Interesting: WCB does not seem to park after end of plot. (Apparent bug: the "park" block in Scratch does not work in 0.9.6. Park does work from within the GUI of RoboPaint 0.9.6, and from this same Scratch session, when talking to 0.9.5.)

Since that seems to be flawless, re-test with RoboPaint v 0.9.5:

Same sketch, same Scratch session.
- Run 1: Flawless
- Run 2: Immediately off -- no chance to see if due to port change. Radically off course and stuttery.
- Reset WCB to home position just in case. Unlock & move -- did not reset WCB, nor restart anything. It appeared to actually be in the home position. As in previous tests, it appears that absolute position and limit checking are maintained, even through loss of turtle position.
- Run 3: Flawless
- Run 4: Immediately off. GO HOME ROBOT, YOU ARE DRUNK.
- Run 5: Flawless.
- Run 6: Flawless.
- Run 7: Light stuttering. Drawing flawless.
- Run 8: Light stuttering. Drawing flawless.

Not sure if anything has changed; things seem to be working more reliably than in previous tests. The amount of elapsed time between subsequent runs may be a factor. I have been unable to confirm any correlation between position loss and port changes via wireshark.

Back to RoboPaint v. 0.9.6 (electron).

Same sketch, same Scratch session
- Run 1: Severe stuttering. Drawing flawless. (Do we have leftover "crud" from previous runs?)
- Run 2: Severe stuttering. Turtle lost.
Restart Robopaint and Scratch, close some processor-hogging browser windows.
- Run 1: Flawless.
- Run 2: Flawless.
- Run 3: Flawless.
- Run 4: Flawless.
- Run 5: Flawless.
- Run 6: Flawless.
- Run 7: Turtle position lost! (After: Park home. Swap dry pen for a new one....)
- Run 8: Flawless.
- Run 9: Flawless.
- Run 10: Flawless.
- Run 11: Flawless.
- Run 12: Flawless.
- Run 13: Flawless.
- Run 14: Flawless.
- Run 15: Flawless.
- Run 16: Flawless.
- Run 17: Flawless.
- Run 18: Flawless.
- Run 19: Turtle position lost!
- Run 20: Flawless.

All considered, it seems like the new version is MUCH more reliable, but there is still something fundamentally "rotten" here. Could this be a simple matter of speed? Version 0.9.6 seems "faster" -- is that the reason that this works so much better? (And if so, can we make it faster yet?)

oskay commented 9 years ago

Testing scratch ONLINE version.

Bookmarklet from above: does not seem to produce working blocks.
Opening saved Sierpinski gasket sketch: Does work, so long as Scratch window is in foreground.

Tests with RoboPaint 0.9.6:

Run 1: Flawless.
Run 2: Flawed.
Run 3: Flawless.

Not enough statistics to understand this yet.

oskay commented 9 years ago

Also: Snap does not seem to spew the constant stream of requests that Scratch does, for what that's worth.

oskay commented 9 years ago

I've re-written the Sierpinski gasket (still without iteration... * ) in Snap, so that we can do the direct comparison. Having run this and the arrowhead curve more times than I care to admit, I have not yet seen any error in Snap.

techninja commented 9 years ago

I'm going to go ahead and close this as it seems the base issues have mostly been dealt with, this can and should be reopened if similar issues show up, unless we have more obvious information to go on, then a better titled and documented bug should be filed. Good work everyone!

evil-mad / robopaint

SOMETHING is rotten... in RoboPaint! #209