MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
16.27k stars 19.23k forks source link

Recover from communication glitches #2000

Closed amigoloko closed 8 years ago

amigoloko commented 9 years ago

Dear Marlin Community,

Im using the marlin firmware connected to repetier host, randomly and suddenly when printing the printer just stops printing, the repetier host gets stuck at a code (when sending them from the log). pause/stop/move the printer with a repetier host commands does not work.

I have to turn everything off, printer, close repetier and then restart from the beginning the print.

Have you run into something like this. i am using arduino mega 2560. One thing that i have on the arduino is to bypass the reset by cutting the stroke, and make a pin/jumper for uploading the firmware, so when ever you want to upload need the jumper no jumper cant upload. dont thinks this to matter but just in case.

from my point of view, it seems that the marlin is at loop of the buffer wainting for more commands. but hard to know for sure. and maybe the repetier/marlin lost connection. Another thought, might be the repetier host, even thou it does not crash it just seems to stop sending commands.

appreciate any help.

thinkyhead commented 9 years ago

I have found USB printing to be so flaky with my Mega2560-based RAMPS in general that I will only print from SD now. But, I suspect both a weak USB signal and too much electrical noise on the connection. Still, it would be nice to be able to recover cleanly from a lost connection, when that is the case.

lcfm1 commented 9 years ago

@amigoloko - I faced such a problem - my control board is under the heated table and it overheated. fire protection, and it is deactivated.I add a little heat insulator - the problem disappeared.

lcfm1 commented 9 years ago

@thinkyhead - Do you have a servo? because it refused to print my board periodically.

amigoloko commented 9 years ago

@lcfm1 - The Arduino/Ramps are on a completely isolated box away from any heat, even though the ramps has 2 12v fans working all time.

amigoloko commented 9 years ago

@thinkyhead Would you suggest the electrial noise is generated by...¿? a better quality usb cable would help? shorter one?

amigoloko commented 9 years ago

Update - i had found that when windows power settings are not set to disable the USB power shutdown this usually happens more often. moreover when the usb are not turned off by windows power setting it still happen.

Wurstnase commented 9 years ago

If you take a look in the log and activate ACK you can see the last command. Under this command there should be an 'ok'. If not this #1922 should help.

amigoloko commented 9 years ago

@Wurstnase ok let me check this...

lrpirlet commented 9 years ago

@thinkyhead… A short USB cable certainly helps, but what you really want is some noise filtered cable. For more information, have a look there: http://en.wikipedia.org/wiki/Ferrite_bead I have replaced my 75 cm unfiltered cable because of transmission errors against a 140 cm filtered USB cable and have no problem at all… Note both ends should have an EMI filter…

fmalpartida commented 9 years ago

I have no problem at all using my SAV MkI using a native USB interface. It runs at 12Mbps peak transfers with a BER which is a joke. I use a 1.5m cable with a choke but also use it with a rubbish cable flawlessly.

One thing that would help is to look at the console and see how many retransmissions happen.

amigoloko commented 9 years ago

@fmalpartida i also had my doubts with the cable, since there are some printers that work as a charm, but then there others that its very constant that they encounter this issue. Nevertheless I will change cables on the ones (right now is one) that present this trouble.

thinkyhead commented 9 years ago

Admittedly, it has been over a year since I tried USB printing, so it may be improved now.

atunguyd commented 9 years ago

I had almost the exact same problem, I observed the following when it happened:

  1. There is nothing in repetiers log to indicate an issue, the last command is sent and then there is nothing.
  2. On every time that this has happend I have found that the heaters are off but the stepper motors are still armed (as in you cant move them by hand). I suspect that merlin FW times out and switched off the heaters.
  3. Removing as reinserting the USB cable on the PC resolves the issue.
  4. Rebooting the PC (without rebooting the Arduino board) also resolves this issue.

I replaced the laptop I was using on my printer and since then have not seen this problem at all. I will reinstall windows on that laptop soon and retry it.

Wurstnase commented 9 years ago

This issue can happen on bad USB connections. To solve this you can try different baud rates. If there is no 'ok' after the last command, the host will wait forever. That's the reason for my PR.

Repetier start with this: "Send "wait" when firmware is idle. Helps solving communication problems when host supports it."

amigoloko commented 9 years ago

@atunguyd that might be the power usb setting on windows have you checked that?

atunguyd commented 9 years ago

@amigoloko Are you referring to the sleep settings? I doubt it as I have disabled sleep on this laptop (cant have a laptop go to sleep during a print), when the problem occurs windows is very much running.

I also notice that when this occures even a emergency stop does not fix the problem (which I believe toggles the DTR line) also pointing to the USB interface as failed

thinkyhead commented 9 years ago

@Wurstnase Looking forward to incorporating that fix. What's the consensus at this point – is everyone happy with your latest code? (My mind is only on bed leveling lately.)

Wurstnase commented 9 years ago

At least for Repetier Host the "wait" will work. I made good experiences with that. My printer/pc/usb has also some issues and I have a missing ok in every print. If I haven't that part in my personal fork I could throw away most of my prints.

The line-number part I don't test. I can't test this actually because my sensor for Z is broken and I give my last one away. Bad timing :)

thinkyhead commented 9 years ago

@Wurstnase Before we deploy the "wait" feature #1922 we should make sure it works ok with Cura Host (@daid) and Printrun (@kliment) too.

Wurstnase commented 9 years ago

Where is these CuraHost? Do I need any extra add on?

I can add a temporary gcode for injecting a missing ok.

thinkyhead commented 9 years ago

@Wurstnase If you have the Cura application that what I mean by "Cura Host." If you use Cura to do a print job, you can see how it deals with "wait". I'm not sure how you can test the fault that it's meant to fix.

Wurstnase commented 9 years ago

I have Cura on my computer and I think some time ago i found the Host-app inside. But it's anyhow hidden? Or does this only appear when I sliced something?

thinkyhead commented 9 years ago

If your printer is connected and you have sliced an object, the middle button in the 3D View is the "Print with USB" button.

Wurstnase commented 9 years ago

Ah ok. When Cura/Pronterface/Octoprint get this feature, I will test this immediately. Both, 'wait' and 'ok linenumber' are optional parts.

daid commented 9 years ago

Cura needs no wait feature.

See: https://github.com/daid/Cura/blob/SteamEngine/Cura/util/machineCom.py#L477 There I force a line send when the communication looks stalled.

Wurstnase commented 9 years ago

Sure, the Host can handle this maybe. But this part is more printer-dependend and I think this should be modified in some way. In Cura it's hardcoded. Also 3 seconds is for my printer way too much. The print will get a lot of blobs.

Anyhow, the printer itself knows if it has nothing to do anymore.

daid commented 9 years ago

In testing, the failed state from which that re-send recovers only happens once every 100 hours. If you get a lot of these errors you should look at the rest of your setup. As at certain S/N ratios it's simply not feasible.

Wurstnase commented 9 years ago

Right. Many, if not most, of the people will never have issues. But there are some which have. I tried a lot, new USB-cable, different baud-rates, but it still happens every hour or so.

This is an optional feature and not everyone will need this. But someone will.

daid commented 9 years ago

Are you sure you are not running into the other USB error?

My old laptop gets "device reports readiness to read but returned no data" exceptions on the USB Serial. Nothing I can do about it, whole communication just stops.

nophead commented 9 years ago

Note the title is wrong. You can't recover from a USB disconnect unless the host closes the port and reopens it because USB is a connection based protocol. Also USB has a link level retry mechanism so it should never lose data. The data should get there intact or the connection be lost, never missing data or corrupt data. If you get that you either have a driver bug on your PC or a hardware bug between the MCU and the USB chip.

With properly working hardware you should never see an error, even every 100 hours, because USB already detects and corrects them with CRCs, timeouts and retries. If that fails it should disconnect.

daid commented 9 years ago

@nophead execpt, very occasionally, the serial data between the ATMega2560 and ATMega16U2 gets corrupted. (For an Arduino Mega that is. Some new boards use a single chip solution that never should have this issue)

Wurstnase commented 9 years ago

Yes @nophead, I don't have a disconnect. In any reason the firmware doesn't send an 'ok' or the host doesn't receive one. This is the problem. In that case, the heater is still active, the complete printer is still active, only the host don't send any command because it waits for the ok.

nophead commented 9 years ago

Probably poor PCB layout or a bug in the ATMega16U2 firmware. With a Melzi and genuine FTDI chips I never see errors.

Wurstnase commented 9 years ago

Probably poor PCB layout or a bug in the ATMega16U2 firmware. With a Melzi and genuine FTDI chips I never see errors.

Maybe, but it needs a solution.

foosel commented 9 years ago

I agree... I have a lot of users running into that, and just telling them "get proper hardware" is not going to cut it, no matter how true it may be. So anything we can do on the protocol side of things to recover from such issues helps the quite heterogeneous user base out there (and hence the people who try to support them).

thinkyhead commented 9 years ago

@amigoloko #1922 has been merged, so if you get the latest code, try enabling the new option and see if it helps.

amigoloko commented 9 years ago

@thinkyhead i will, for the record. By now i have tried something on one machine, the one with continue stops: Brand new two cables, internal and external no cheap ones, hand made ferrite shield for the cable. Result. the continuous stops vanished. Now the not so often stops are still here, is very hard to predict them, and to catch them. i have noticed on Repetier Log, that sometimes, gets two commands (codes) at a time, then an OK. Does the OK should be immediately after each code?

thinkyhead commented 9 years ago

Are there particular codes that don't say "ok" right away, or is there no pattern there?

nophead commented 9 years ago

All codes only reply when they are finished apart from G1, which replies when it is put into the planner queue. So any code that takes significant time and isn't a G1 will delay the OK. Also any codes that stack behind a slow one in the command queue will also reply late.

thinkyhead commented 9 years ago

@nophead The situation could be improved on a code-by-code basis, to get them to send "ok" earlier. In fact, if it's only meant to be an acknowledgement of the command received, then we could just send "ok" at the top of process_commands() instead of at the end. It's not a lie to tell the host "ok" and then to throw an error and not run the command received, right?

daid commented 9 years ago

@thinkyhead "ok" means "command received and handled", as some commands also reply with extra data, and that data needs to be before or with the OK.

G1 and G0 should be seen as "queue move" in this aspect. Not as execute move.

thinkyhead commented 9 years ago

Great! So in that case, again we now have ADVANCED_OK that includes the sequence number, so there will be no more confusion, ever, ever again.

nophead commented 9 years ago

@thinkyhead, If you sent OK as soon as the command went into the queue the host would then send the next one and so on until the queue got full and then you would be into big delays again waiting for a slow command to complete and make a space in the queue.

thinkyhead commented 9 years ago

@nophead Well the important thing is, somehow it works most of the time, in proper cyberpunk jalopy fashion.

nophead commented 9 years ago

If there are no comms errors, which there shouldn't be with USB (only disconnects if the hardware and drivers are correct) then it will work. Problems only surface when there are comms errors because it isn't a properly designed link level protocol.

thinkyhead commented 9 years ago

@nophead Frankly, the communication protocol seems "good enough" at this point, in spite of various caveats. I notice that the Witbox and Hephestos configurations had added 1 to the BUFSIZE (5 instead of 4) claiming it helped. Perhaps on boards with extra space there might be value in having bigger buffers, I can't say. Anyway, totally unrelated to error-recovery, I know. Buffers are going to block.

The thing still lingering on my brain about "buffers r gonna block" is that maybe an alternative protocol (or mode) would work better – one where Marlin must explicitly ask the host for the next N commands, and the host then only sends commands when Marlin asks. The ADVANCED_OK basically does this by letting the host know how many new commands the firmware can handle, but it still leaves the choice open to the host…. Just brainstorming in circles here…

nophead commented 9 years ago

Well that is reversing the master slave roles, which would have a big impact on hosts.

foosel commented 9 years ago

Well the important thing is, somehow it works most of the time

No. The important thing is not to get stuff working in ideal conditions. What distinguishes good from bad protocols/software/hardware is not how it performs under lab conditions but how (and if) it handles problems and recovers from them. Which is why proper testing involves testing failure cases, not just expected good cases.

Statements like the above scare me like hell when coming from a maintainer of probably the most used firmware for 3d printers.

Also, what @nophead said.

daid commented 9 years ago

Advantage of ADVANCE_OK is that it is (mostly) backwards compatible. Reversing roles isn't.

Communication errors (with the Arduino Mega 2560) happen, I've seen it. So the checksum is important.

I'm using BUFSIZE 8 on the Ultimaker2. Helped with some internal issues, and people who did use USB printing with it, have reported little issues.

thinkyhead commented 9 years ago

Statements like the above scare me like hell

It was not intended for your ears, particularly. In the moment it makes more sense. You had to be there.