area515 / Photonic3D

Control software for resin 3D printers
http://photonic3d.com
GNU General Public License v3.0
131 stars 115 forks source link

Crazy issue with zombie .gcode files from similarly named CWS print attempts #335

Closed thetrebor closed 6 years ago

thetrebor commented 6 years ago

Case: using photonics browser gui to upload multiple versions of a cws file to a LCHR running the phtonic3d software sometimes causes weird job failures.

Steps to reproduce: Make a scene in CWS or zip with slices Upload the file, and start a print Cancel the print at the first slice Rename the CWS, make a minor change to the scene (delete a layer, e.g.) Upload the file, and start a print

() not sure if entries marked with are necessary or were artifacts of me trying to get a successful print started 11 times.

You will probably start getting failures based around the private method File findGcodeFile(File jobFile) in CreationWorkshopSceneFileProcessor.java which for some reason is detecting the OLD gcode when it does a file scan and it finds two or more. Verified by manually renaming the gcode file in a subsequent upload to something malformed and was able to get past the error. This was the 11th try. This indicates that one (probably the first) canceled print failed to clean up after itself and left a stray gcode file in the way that wasn't ignored when you uploaded a CWS which was substantially the same apart from a new file name and minor tweaks inside files. This zombie gcode file was left in place even after rebooting device and restarting the photonics software multiple times.

I'm guessing one or more of the following is possibly true: the java file handle isn't letting go so when you delete / cancel a print the unzipped tmp files are left hanging out. a similarly named cws scene will later be unzipped and the "duplicate" gcode files will be swept up in a file scan for .gcode files. the tmp directly for the unzip isn't being cleaned, either due to permissions, file handles, or system gnomes the method is not properly scoped to the current working directory for the current model or maybe something more esoteric.

i recommend making strongly sure you have a really clean super uniquely named tmp directory for these models every time you start a print process.

thetrebor commented 6 years ago

LogBundle-2.zip LogBundle.zip

WesGilster commented 6 years ago

Thanks for finding a reproducible scenario. We've had complaints of this issue in the past, but haven't had a way to reproduce it reliably. Does your scenario reproduce it 100% of the time?

Technically, Photonic3D cleans up after itself before and after a print. I can see there are some conflicts due to you running a print multiple times: org.area515.resinprinter.exception.SliceHandlingException: org.area515.resinprinter.job.JobManagerException: It currently isn't possible to print more than 1 Creation Workshop Scene file at once.

Due to the sporadic cleanup issues we've had, quite a while back, I put in some debugging code in here: org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.deleteDirectory(File) This is the first feedback I've gotten to determine what is happening when this problem happens. The Apache deleteDirectory() code believes the directory is empty, and so it tries to delete the directory. This fails, because apparently the OS found two files remaining: drwxr-xr-x 2 root root 4096 Aug 19 16:56 . drwxr-xr-x 3 root root 4096 Aug 19 16:56 .. -rw-r--r-- 1 root root 290306 Aug 19 16:56 apollo_0001.stl -rw-r--r-- 1 root root 218612 Aug 19 16:56 manifest.xml

I repeat this a couple of time before I give up and move on. This is really nasty, because it seems as though the file system get's corrupted as restarts don't seem to fix the problem.

Are you sure this problem isn't due to image previews happening simultaneously with a print? For example, this problem is happening on image preview:

2017-08-19 16:56:53,117 ERROR o.a.r.j.CreationWorkshopSceneFileProcessor [qtp31335628-19] Error List
java.io.FileNotFoundException: File does not exist: /tmp/printdir/apollo.cws/apollo0420.png
    at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2275) ~[commons-io-2.4.jar:2.4]
    at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) ~[commons-io-2.4.jar:2.4]
    at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) ~[commons-io-2.4.jar:2.4]
    at org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.deleteDirectory(CreationWorkshopSceneFileProcessor.java:253) [photonic3d.jar:?]
    at org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.prepareEnvironment(CreationWorkshopSceneFileProcessor.java:346) [photonic3d.jar:?]
    at org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.renderPreviewImage(CreationWorkshopSceneFileProcessor.java:290) [photonic3d.jar:?]
    at org.area515.resinprinter.job.AbstractPrintFileProcessor.buildPreviewSlice(AbstractPrintFileProcessor.java:482) [photonic3d.jar:?]

To your suggestion about using a unique directory on each print. Yes, that's an option we've considered with the UUID.getRandom() but this is going to cause quite a few orphaned files out in the tmp directory and it doesn't really help for optimizing the image preview. We don't really want to extract all files into the tmp directory under a new UUID every time that the GUI asks us to preview the image.

thetrebor commented 6 years ago

i'm out of town for a week but will attempt to re-verify a set of steps to reproduce. i was specifically not trying to reproduce it, but seemed to hit it by accident a few times and attempted to introspect a possible cause based on what i did different trying this print. i'll try to trigger it overtly when i'm back.

another possible cause may be the lag in response between the remote UI and the on-device controls. so if there's a combination of cancels and starts it may trigger an annoying race. the user ui feedback on adding, uploading, starting, cancelling, and inspecting jobs may be part of this culprit as it's easy to accidentally send conflicting requests and leave yourself in a state that's hard to clean up, especially if the user panics or gets impatient.

i'm not entirely sure it's unrelated to image preview but since it's stl and manifest being left behind it would seem that's possible. do you really need to generate a preview when the slice images are right there anyway? that seems like it would be computationally easier to grab the first image from the cws manifest and use that since that's what you're printing anyway not an unrelated slice done on-machine.

evidently there's a method FileUtils.forceDelete(dir); that might help, if you're not already using it. open file handles may still be a problem though.

as for the tmp directory. you can create your own tmp subdir and keep that clean right? so /tmp/photonic/jobs/[job-guid] the cleaning process can look for crusty old 'job-guid' directories that are probably left over cruft. and it also gives you a blank slate for each upload request. you keep uploaded cws and zip in /tmp/photonic/models (or whatever) and when the user grabs one to print you create a guid and work from there.

of course i may be drastically oversimplifying things and i've only taken a quick skim through the code to see what was generating my errors. i appreciate your hard work on this for the community. hopefully i can help more soon.

On Sun, Aug 20, 2017 at 2:20 PM Wes G. notifications@github.com wrote:

Thanks for finding a reproducible scenario. We've had complaints of this issue in the past, but haven't had a way to reproduce it reliably. Does your scenario reproduce it 100% of the time?

Technically, Photonic3D cleans up after itself before and after a print. I can see there are some conflicts due to you running a print multiple times: org.area515.resinprinter.exception.SliceHandlingException: org.area515.resinprinter.job.JobManagerException: It currently isn't possible to print more than 1 Creation Workshop Scene file at once.

Due to the sporadic cleanup issues we've had, quite a while back, I put in some debugging code in here:

org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.deleteDirectory(File) This is the first feedback I've gotten to determine what is happening when this problem happens. The Apache deleteDirectory() code believes the directory is empty, and so it tries to delete the directory. This fails, because apparently the OS found two files remaining: drwxr-xr-x 2 root root 4096 Aug 19 16:56 . drwxr-xr-x 3 root root 4096 Aug 19 16:56 .. -rw-r--r-- 1 root root 290306 Aug 19 16:56 apollo_0001.stl -rw-r--r-- 1 root root 218612 Aug 19 16:56 manifest.xml

I repeat this a couple of time before I give up and move on. This is really nasty, because it seems as though the file system get's corrupted as restarts don't seem to fix the problem.

Are you sure this problem isn't due to image previews happening simultaneously with a print? For example, this problem is happening on image preview:

2017-08-19 16:56:53,117 ERROR o.a.r.j.CreationWorkshopSceneFileProcessor [qtp31335628-19] Error List java.io.FileNotFoundException: File does not exist: /tmp/printdir/apollo.cws/apollo0420.png at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2275) ~[commons-io-2.4.jar:2.4] at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) ~[commons-io-2.4.jar:2.4] at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) ~[commons-io-2.4.jar:2.4] at org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.deleteDirectory(CreationWorkshopSceneFileProcessor.java:253) [photonic3d.jar:?] at org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.prepareEnvironment(CreationWorkshopSceneFileProcessor.java:346) [photonic3d.jar:?] at org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.renderPreviewImage(CreationWorkshopSceneFileProcessor.java:290) [photonic3d.jar:?] at org.area515.resinprinter.job.AbstractPrintFileProcessor.buildPreviewSlice(AbstractPrintFileProcessor.java:482) [photonic3d.jar:?]

To your suggestion about using a unique directory on each print. Yes, that's an option we've considered with the UUID.getRandom() but this is going to cause quite a few orphaned files out in the tmp directory and it doesn't really help for optimizing the image preview. We don't really want to extract all files into the tmp directory under a new UUID every time that the GUI asks us to preview the image.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/area515/Photonic3D/issues/335#issuecomment-323602243, or mute the thread https://github.com/notifications/unsubscribe-auth/AASq9sZhzD-hHzZ8EjdGtMZS4QeB8W_Cks5saHjrgaJpZM4O8aK5 .

thetrebor commented 6 years ago

Okay i've played with this a bit more. To simplify everything i tried to upload just a naked STL which according to your docs should just work and let me print using an on-device print profile. This fails. Now the interesting part, and the part which may be the core of your issues is that it appears to be getting freaky with the names of the files that are being uploaded. I uploaded "succubus-support.stl" and it failed. so i renamed it to "succubus-support2.stl" and tried again and it failed. I looked at the logs and noticed that there was no mention of "succubus-support2.stl". It appeared to be "stuck" thinking about the original file name every time i tried uploading. So i quit the application, uploaded #2 and the logs started to reflect that (thought it still failed). Then i renamed it to "succubus-support3.stl" and boom, it's still thinking about #2. Even more oddly, it seems to be carrying on about a file named "apollo" as well, which had been removed from the machine. My quick diagnosis is that if something goes wrong with the upload, the AngularUI gets "stuck" and even if you try to give it a new file name it just tries to upload the same old thing again until you restart it. If that file doesn't exist it may be uploading a blank or null file causing an invalid EOF error on the device. Now there appears to be a second confounding bug that may have to do with deleting files when they are deleted from the on-machine UI. And there may be a third confounding bug that causes the system to silently reject some files as it appears that the failure is being triggered by a timeout and not a actual error handler. LogBundle.zip https://www.dropbox.com/s/ktcv6ln9ym93aku/succubus-support3.stl.zip?dl=0

You of course will be much better at reading and interpreting these files. I'm positing: 1) upload file x1 -> it fails for any reason (your error messages are very very laconic) 2) rename file x1 -> x2 (the user says, "maybe i did something wrong" and tries to fix it) 3) upload file x2 -> it fails because it's stuck thinking about x1 4) on machine processes panic trying to deal with nonexistent x1 5) everything feels broken to the user and they do increasingly superstitious things to try to get it work 6) you get a weird complaint

WesGilster commented 6 years ago

Awesome, thanks for the recreation steps. I'll look at that.

To your other STL issue, please open a separate bug if you actually believe this is an issue. Yes, it's possible that a 120MB STL file is going to be too large for the Raspberry Pi to slice on the fly. image

With the proper amount of memory, Photonic3D "should" slice anything: image

WesGilster commented 6 years ago

You mention that there could be any error uploading a file. If any error will reproduce this stuck file problem, could you send me an example of how to reproduce one. If not, I'm hoping that a clientside cancel will reproduce this stuck file issue in a similar way. It looks like that's what you were attempting to do:

2017-09-01 22:17:33,664 ERROR o.a.r.s.PrintableService [qtp28835272-195] Error while uploading file: /root/uploaddir/succubus-support.stl
org.eclipse.jetty.io.EofException: Early EOF
    at org.eclipse.jetty.server.HttpInput$3.noContent(HttpInput.java:791) ~[jetty-server-9.3.7.v20160115.jar:9.3.7.v20160115]
    at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:157) ~[jetty-server-9.3.7.v20160115.jar:9.3.7.v20160115]
WesGilster commented 6 years ago

Clicking the cancel button doesn't cause an error as I had hoped, I fiddled around a bit and found that spamming the project image button does cause an error, so I focused on that:

What I found was that that unpack was being executed way more often than it should. In the new dev build, it will now only unpack and clean when the CRC checksum differs from the last upload. This means unpacking will happen even if you change a single letter, but the filename is identical. This should be quite a performance enhancement.

All that said, it doesn't fix the problem you are specifically referring to as I'm having significant trouble getting the upload to fail on the very first step of your instructions.

thetrebor commented 6 years ago

so i noticed something odd that might contribute. i have the software full screened as a separate space on my mac. when i upload, because i'm impatient i'll switch desktops to another task and then return a few minutes later to check on it. i noticed when you swap back to the screen the angular ui appears to start the progress bar again (I've never seen this before in angular myself). is it accidentally spamming the upload if you unfocus and refocus the application? also i feel like i have to be abnormally patient and not molest the ui in any way during an upload. if you click on the background of the upload modal, for instance, it will dismiss the modal and that may be leaving it in a weird state if you click around the other tabs. i wish i could run this in a dev build for you but i only have a consumer version of the controller running on my liquid crystal hr

as for the stl file upload failing i was being very patient and trying not to rock the boat. all i did was grab that stl and upload it from the desktop ui to the controller. it was considerably smaller than the 300mb file i printed successfully earlier as a CWS. is there some other info from my rig to see if it memory or something else?

On Sun, Sep 3, 2017 at 5:08 PM Wes G. notifications@github.com wrote:

Clicking the cancel button doesn't cause an error as I had hoped, I fiddled around a bit and found that spamming the project image button does cause an error, so I focused on that:

What I found was that that unpack was being executed way more often than it should. In the new dev build, it will now only unpack and clean when the CRC checksum differs from the last upload. This means unpacking will happen even if you change a single letter, but the filename is identical. This should be quite a performance enhancement.

All that said, it doesn't fix the problem you are specifically referring to as I'm having significant trouble getting the upload to fail on the very first step of your instructions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/area515/Photonic3D/issues/335#issuecomment-326831405, or mute the thread https://github.com/notifications/unsubscribe-auth/AASq9metCLdqJM0926UnIcXMwIrTeCz6ks5sexVYgaJpZM4O8aK5 .

WesGilster commented 6 years ago

Yes, it will drop the modal dialog, but honestly, I don't believe the upload actually cancels. I'll check that for sure and I'll fiddle around with that a bit more. Its possible Mac has some quirks that I'm not going to be able to reproduce.

Ouch, in one of the Photocentric builds, they disabled direct STL printing, hopefully it's not yours. Otherwise, you can send the stl to me and I can see if it prints. Generally tho, the Raspberry Pi does a perfectly good job(outside of resourcing issues) of doing anything that a regular Windows box or Linux distro can do. As you can see above, my Windows machine sliced fine with zero enhancements. If you want to give me a dump of what happened, I'm sure it'll tell me what it was. As of yet, I don't really know "when" specifically you are saying the stl failed. On upload, on click, on preview, on print... Feel free to open a bug on this since these two topics are somewhat unrelated.

DanielJoyce commented 6 years ago

If you're worried about orphan files, there are all sorts of ways to handle that. One is tmp is cleaned up on reboot.

https://stackoverflow.com/questions/28752006/alternative-to-file-deleteonexit-in-java-nio https://alvinalexander.com/java/java-temporary-files-create-delete

Ways to skin the cat wrt to having files and temp directories clean up after themselves, either on jvm exit or file close. NIO has options as well.

Typically what one does is when a file is uploaded, you put it in a temp directory. And when upload completes you move the file to its final desired location. If something goes wrong at any step you clean it up.

DanielJoyce commented 6 years ago

CRC is unreliable as many files can have same CRC. You should use SHA1 to determine file uniqueness. CRC merely checks corruption, it should not be used for uniqueness.

WesGilster commented 6 years ago

As you mentioned, SHA1 guarantees cryptographic assurance of uniqueness of a single file against all other files ever created(excluding the "SHAttered" collision). We simply need to ensure that one version of a file of the same name is different than the previous version of the same name. That's a comparison of two files which is slightly closer to how CRC was envisioned to work. Where SHA1 is a bit more relevant is comparing a set of bytes against several other sets of bytes, which is why I used SHA1 for the customizer front end image cache and because that set of bytes is much smaller.

The code works precisely as you describe with the precise method that you reference here: org.area515.resinprinter.services.PrintableService.saveFile(InputStream, File) https://github.com/area515/Photonic3D/blob/834a34bcbfcde693a240c45f349c50bd932ef9b5/host/src/main/java/org/area515/resinprinter/services/PrintableService.java

With closer analysis you'll find the problem used to be on concurrent execution of the unpack: org.area515.resinprinter.job.CreationWorkshopSceneFileProcessor.deleteDirectory(File) https://github.com/area515/Photonic3D/blob/834a34bcbfcde693a240c45f349c50bd932ef9b5/host/src/main/java/org/area515/resinprinter/job/CreationWorkshopSceneFileProcessor.java

I haven't seen the problem since I've introduced the file synch and CRC checksum. I'll probably be closing this bug soon if it can't be repeated with the development version.

WesGilster commented 6 years ago

Closing since this bug was fixed long ago in development