[bugfix-2.0.x] TMC2130 Re-ARM E stepper disables randomly

vilsed commented 5 years ago

Hello all.

I'm having a really bizzare issue with my Re-ARM delta machine. During a print, seemingly randomly, my E stepper driver gets disabled. The motor just stops moving. M122 gives me E driver status as "disabled". Sometimes it happens after half an hour, sometimes three hours in, sometimes ten hours in. I've tried swapping drivers around, disabled DISABLE_INACTIVE_EXTRUDER, no difference. Same thing happened with bugfix-2.0.x from mid December, same with the latest version.

And of course, as always, it's the only machine in our farm capable of printing PC, and we have an urgent order for PC parts. Life is life.

I'm attaching my configuration files. Thank you all in advance.

configuration.zip

p3p commented 5 years ago

I'm not really familiar with the TMC drivers and whether this would be reported in the serial output, but this sounds like overheat protection shutdown (or some other automatic protection kicking in)

swilkens commented 5 years ago

I believe there is a TMC debug flag that will output the status of the driver, that will tell you if the driver went into overtemp protection.

http://marlinfw.org/docs/gcode/M122.html

Though apparently the latest 2.0.x-bugfix no longer requires that flag to be enabled to use M122: https://github.com/MarlinFirmware/Marlin/issues/12942

vilsed commented 5 years ago

Yes, I've forgot to mention. No overheating flag. First I thought the same, added excessive cooling (two 80mm fans blowing basically right into the drivers), no difference. Also tried reducing current, microstepping, no difference. Tried enabling current reduction in regard to overheating - no difference. And as I mentioned - no OT_PREWARN flag.

It seems like the E1_ENABLE pin just goes low for whatever reason. Bizzare.

p3p commented 5 years ago

There can't (probably) be a systemic issue in the enable control, if you check the voltage level of the pin controlling driver enable it will likely be in the correct state even with what the driver is reporting its internal state as over SPI, if that is the case perhaps the driver is not being setup correctly by the TMC library and hitting some kind of default limit.

(If an enable pin is toggling at random for no reason I don't want to be the one trying to debug it)

vilsed commented 5 years ago

Catching the pin state at the exact moment when the driver "shuts down" seems like a real head scratcher for me, not having a storage oscilloscope or a logic analyzer of any sort, exactly because the shut down happens so inconsistently. Will think about it, maybe I'll be able to cobble something to plug in my computer sound board and log with Audacity or something.

As a side note, when the driver becomes disabled, pausing a print and issuing M18 and sending some G1 E move does not enable it. It becomes enabled only after I cancel the print, and then start another one.

Another side note - I'm using Fystec drivers. But as mentioned, swapping them around causes no change.

vilsed commented 5 years ago

I've just noticed that teemuatlut has added driver enabling over SPI. Will try it right away.

p3p commented 5 years ago

As a side note, when the driver becomes disabled, pausing a print and issuing M18 and sending some G1 E move does not enable it. It becomes enabled only after I cancel the print, and then start another one.

That makes sense, if the driver starts ignoring the enable pin because of internal error, then the pin will need turned off then on again to clear it, a pause doesn't disable a driver as it will lose micro-step position ( probably more depending on the motion system)

Checking the enable pin is in the correct state shouldn't be hard with just a multi-meter as you can just check anytime after it stops working before you cancel the print.

@teemuatlut is definitely the guy you should be talking to though about TMC drivers

teemuatlut commented 5 years ago

The TMC drivers can go into error mode (or whatever it was called) for a number of reasons in order to protect itself. The error states are if the driver is heating all the way up to overtemp. This comes at an even higher point than over temp prewarn. Other error states are short to ground and open connection. The latter does not trigger driver self protection. Everything related to TMC2130 diagnostics is described in chapter 12 of the datasheet.

Of course the other two reasons for the driver to stop responding is if the EN pin goes inactive or if we'd write off time setting to zero. Both of these I would consider to be unlikely.

My guess is that the driver for a valid reason or not sees short to ground and shuts down. Try switching your stepper motor. And driver slot if possible.

vilsed commented 5 years ago

@teemuatlut Excellent point about open/short circuit on the motor side. The E motor leads are indeed somewhat dodgy, spliced in at least two places :) I'd much rather expect an open circuit out of them, but everything is possible, I guess. They're running through a 75°C chamber after all. Those will be the first thing I'll check. Will also inspect the RAMPS, although it has been working flawlessly for three years now. I think we can rule out weirdness on the physical ENABLE pin, since SOFTWARE_DRIVER_ENABLE is up and running, and I think I'll keep it that way.

gloomyandy commented 5 years ago

@visted when you switched to using SOFTWARE_DRIVER_ENABLE what did you do with the driver enable pin? My understanding is that it needs to be pulled to GND (so the driver is always enabled at the hardware level) for correct operation.

vilsed commented 5 years ago

what did you do with the driver enable pin?

I desoldered the pin that would've went to the socket on the RAMPS and soldered a short jumpcable to one of GND pins on the drivers themselves. Seemed like the easiest and fastest solution :)

img_20190119_140459

vilsed commented 5 years ago

I've disassembled and inspected the hardware. E motor wires indeed were holding together on a thread at one splice point. Replaced the wires. Also disassembled the motor and inspected it. Can't see no obvious short circuits, but if there were any - they would probably be inside the coils, hence impossible to find. Inspected the RAMPS board. Nothing suspicious. No free solder blobs or wire pieces that could short anything out. Also measured the Vmot filtering caps, everything checks out.

My best bet now would be the motor wires. Spinning the motor now (while the driver is not enabled) feels much different than it did before. But as @teemuatlut stated before, TMC drivers do not treat an open motor circuit as a fault condition, disabling themselves.

Anyhow, now running a 6+ hour print, we'll see what's what.

vilsed commented 5 years ago

Update on the situation.

Everything went great for about 4 and a half hours. Then, all of the sudden, E driver went into spreadCycle mode (it was set for running in stealthChop). But curiously - it didn't stop. Well, it did, but not entirely. The extruder was still performing linear advance moves, but not advancing forward. I quickly connected to a host (it was a SD print), sent M122, and this is the report:

21:41:05.848 : X Y Z E 21:41:05.857 : Enabled true true true true 21:41:05.857 : Set current 800 800 800 400 21:41:05.862 : RMS current 795 795 795 397 21:41:05.866 : MAX current 1121 1121 1121 560 21:41:05.867 : Run current 25/31 25/31 25/31 12/31 21:41:05.867 : Hold current 12/31 12/31 12/31 6/31 21:41:05.871 : CS actual 25/31 25/31 25/31 0/31 21:41:05.875 : PWM scale 44 57 69 0 21:41:05.880 : vsense 1=.18 1=.18 1=.18 1=.18 21:41:05.884 : stealthChop true true true false 21:41:05.889 : msteps 32 32 32 16 21:41:05.893 : tstep 1865 336 167 1169 21:41:05.893 : pwm 21:41:05.893 : threshold 0 0 0 0 21:41:05.893 : [mm/s] - - - - 21:41:05.898 : OT prewarn false false false false 21:41:05.898 : OT prewarn has 21:41:05.898 : been triggered false false false false 21:41:05.902 : off time 4 4 4 4 21:41:05.907 : blank time 24 24 24 24 21:41:05.907 : hysteresis 21:41:05.911 : -end 2 2 2 2 21:41:05.917 : -start 1 1 1 1 21:41:05.917 : Stallguard thrs 0 0 0 0 21:41:05.917 : DRVSTATUS X Y Z E 21:41:05.920 : stallguard X 21:41:05.925 : sg_result 0 0 0 0 21:41:05.929 : fsactive 21:41:05.933 : stst 21:41:05.938 : olb X 21:41:05.943 : ola X X 21:41:05.947 : s2gb 21:41:05.951 : s2ga 21:41:05.956 : otpw 21:41:05.960 : ot 21:41:05.961 : Driver registers: 21:41:05.962 : X 0x00:19:00:00 21:41:05.963 : Y 0x00:19:00:00 21:41:05.964 : Z 0x20:19:00:00 21:41:05.965 : E 0x60:00:00:02 21:41:05.966 : Testing X connection... OK 21:41:05.968 : Testing Y connection... OK 21:41:05.969 : Testing Z connection... OK 21:41:05.970 : Testing E connection... OK

Then I navigated to the TMC control menu, noticed that stealthChop was still enabled for all axes. Disabled and re-enabled it for E, it did in fact went into stealthChop, but still just waggled back and forth slightly. Gave the extruder some help by hand, and it pushed through. It did started extruding, but it seemed that it lost most of its torque, i.e. skipping steps on untretractions, was very easy to stop by hand. Sent M122 again, and this is the log:

21:44:37.517 : X Y Z E 21:44:37.526 : Enabled true true true true 21:44:37.527 : Set current 800 800 800 400 21:44:37.530 : RMS current 795 795 795 397 21:44:37.535 : MAX current 1121 1121 1121 560 21:44:37.535 : Run current 25/31 25/31 25/31 12/31 21:44:37.536 : Hold current 12/31 12/31 12/31 6/31 21:44:37.539 : CS actual 25/31 25/31 25/31 0/31 21:44:37.544 : PWM scale 45 49 47 52 21:44:37.548 : vsense 1=.18 1=.18 1=.18 1=.18 21:44:37.553 : stealthChop true true true true 21:44:37.557 : msteps 32 32 32 16 21:44:37.562 : tstep 11081 230 231 277 21:44:37.562 : pwm 21:44:37.562 : threshold 0 0 0 0 21:44:37.562 : [mm/s] - - - - 21:44:37.566 : OT prewarn false false false false 21:44:37.567 : OT prewarn has 21:44:37.567 : been triggered false false false false 21:44:37.570 : off time 4 4 4 4 21:44:37.574 : blank time 24 24 24 24 21:44:37.575 : hysteresis 21:44:37.580 : -end 2 2 2 2 21:44:37.584 : -start 1 1 1 1 21:44:37.584 : Stallguard thrs 0 0 0 0 21:44:37.584 : DRVSTATUS X Y Z E 21:44:37.588 : stallguard 21:44:37.593 : sg_result 0 0 0 0 21:44:37.597 : fsactive 21:44:37.602 : stst 21:44:37.606 : olb 21:44:37.610 : ola 21:44:37.616 : s2gb 21:44:37.619 : s2ga 21:44:37.624 : otpw 21:44:37.628 : ot 21:44:37.629 : Driver registers: 21:44:37.630 : X 0x00:19:00:00 21:44:37.631 : Y 0x20:19:00:00 21:44:37.632 : Z 0x00:19:00:00 21:44:37.633 : E 0x00:00:00:00 Bad response! 21:44:37.634 : Testing X connection... OK 21:44:37.635 : Testing Y connection... OK 21:44:37.637 : Testing Z connection... OK 21:44:37.675 : Testing E connection... Error: All LOW

What the heck? I mean, I have the MONITOR_DRIVER_STATUS enabled, that would explain the torque loss if the driver was in fact overheating, but I also have REPORT_CURRENT_CHANGE on. And nothing was being reported. Is the driver dying? If it is, why swapping it with another makes no difference?..

Will try swapping everything over to E1 slot on the RAMPS. It's too band that I currently have no other motor to swap around.

teemuatlut commented 5 years ago

What the heck?

What's the confusing part?

that would explain the torque loss if the driver was in fact overheating

The driver over heating would result in it temporarily disabling itself (or maybe until GSTAT was read...), not loss of torque.

I also have REPORT_CURRENT_CHANGE on. And nothing was being reported.

The current settings are not changed. Nothing to report.

The more interesting issue might be why your communication to the E driver works only sometimes. Try checking your wiring with a multimeter.

I haven't personally used Linear Advance with my printers but the one time I did try it, the drivers didn't like it. This was quite a long time ago however and there have been significant changes since, but perhaps try disabling this feature. Very unfortunate that your problems typically manifest after many hours of printing...

EDIT: It might also help narrow down the problem if you'd be able to determine if your problem follows the driver or the slot in the motherboard or the stepper motor (wiring included).

vilsed commented 5 years ago

Try checking your wiring with a multimeter

Already done that, while inspecting the RAMPS. Continuity everywhere where it should be. Although, now I remember that yesterday, one time after booting (probably after compiling and uploading code with added TMC menu) the machine immediately threw out "TMC communication error" or something by those lines. I did not investigate at that moment, just rebooted and it booted fine. I will redo the SPI loom anyway, just to be sure.

the one time I did try it, the drivers didn't like it

Using it ever since that machine got Re-ARM and TMC drivers. Never had any issues. Until now, maybe? Will try disabling it.

As a side note, I'm using a modified Gregs style extruder with 2:1 belt reduction and slightly mismatched motor. Achieving high accelerations and speeds with that extruder always was a struggle, but ever since I've switched over to 24V and @teemuatlut added preconfigured chopper settings, it's been pretty good. Tried playing around with those settings myself before, with no great luck.

Very unfortunate that your problems typically manifest after many hours of printing...

Yeah, wasting expensive filaments... Oh well :)

Switched the driver to another slot, running a print again. We'll see tomorrow.

paulluby commented 5 years ago

Hi Guys

Thought I'd join as I'm now using Marlin 2.0 on Re-Arm boards and thought I might help by providing feedback.

I have had a similar issue on an own designed large 3d printer.

Print capacity is 410 x 410 x 610 mm using Ramps 1.6, Arduino Mega, a full graphics display and Marlin 1.1.8 and 1.1.9 on it for a year now. 4988 drivers on dual z steppers and extruder stepper, 2208 drivers on x and y steppers. All with no problems at all and I leave it printing for days, last print was 6 days 10 hours ands 6 minutes.

I placed a Re-Arm board that I'd tested on a smaller printer on the larger printer, started it off on a 12 hour print and all was well at about the 5 hour point. Went into my workshop at the 8 hour point and all the steppers were disabled, x y dual z and extruder all disabled..

Extruder temp was still up at 205 DegC and bed temp was at 60 DegC as I'd specified in Simplify 3d and both were being controlled at those temps. I could still rotate through the menus on the display, just the steppers were disabled.

Put the Arduino Mega complete with Marlin 1.1.9 back on the large printer and 12 hour print completed ok.

Put the Re-Arm board back into the smaller printer and its being doing 4 hour prints quite happily

Hope this doesn't confuse the issue but can't find anything else about randomly disabling of steppers.

Cheers for any thoughts.

vilsed commented 5 years ago

UPDATE: Two 8h+ prints went by without any problems. Not regarding the E driver weirdness, at least. So I'm concluding that the issue was RAMPS E0 slot shorting out somehow. Three years is the limit for cheap chinese RAMPS boards, time for a refresh I suppose :)

@paulluby Regarding your issue. Late summer I was running a continuous 140h print on this machine of mine, completed without any issues. Although it might be a bit irrelevant with my latest adventures.

Since, as you say, all the drivers were disabled, and they are of different types, I'm not sure what the problem could be. It must be something about your large printers' hardware, since the Re-ARM works fine in the small printer, right? Is it equipped with the same drivers while there? Have you tried long prints on the small printer?

paulluby commented 5 years ago

@vilsed

Yep same driver types on smaller printer and so far printing great, just started a 5 hour print on the smaller one, will see how it goes.

All my printers use 2208 drivers on X and Y with 0.9 Deg motors, 4988 drivers on Z (or Z's) and E (or E's) with 1.8 Deg motors. It's a combination that for me is proven.

Have got a couple more Re-Arm boards on order, will try another. Strange how it is now okay with the Arduino Mega fitted though.

Should have also said I use my 2208 drivers in standalone mode.

last edit - Just spent 30 mins removing Re-Arm from smaller printer and putting it in the larger printer. Doing the same 12 hour print as a test. Will see what happens.

Picture of larger printer below, or as I call it "The Beast"

mil fal

paulluby commented 5 years ago

@vilsed

Refitted Re-Arm bored to large printer as I said and the 12 hour print has completed fine.

Strange.

vilsed commented 5 years ago

I'm noticing you're using Mk2A beds. Are you sure that your power supply is powerful enough and of decent quality? Re-ARM is weak to brown-outs, as I have experienced.

paulluby commented 5 years ago

@vilsed

Yep got more than enough power.

The boards, steppers and extruder are supplied by a 12V 20 Amp Supply.

A MOSFETs to each bed is controlled by Marlin/RAMPS as per normal.

The bed power a little different.

The four beds each have a temp sensor that goes to an Arduino Nano which checks each temp sensor against the others for a "difference" in temperature.

The actual temperature is fairly irrelevant it's the "difference" that is the key.

The beds are connected to the 12V outputs of a 70 Amp PC PSU.

The Nano controls the PS-ON pin on the PC PSU and if the temp "difference" between the beds goes out of limits then the Nano turns off the PC PSU output.

Remember this "difference" can be negative if one of the beds MOSFETs fails open circuit or positive if the bed MOSFET fails short circuit.

Once the Nano turns off the power to the beds and the beds cool, the bed temp sensor to the RAMPS board, which is under the number 1 bed, senses the cooling bed and stops the print as per normal.

It's the best multi-bed safety system I could think of and so far it's performed flawlessly.

vilsed commented 5 years ago

So unnecessarily complicated... Every component in the system reduces overall reliability, and just because you haven't had any problems yet, it does not mean that you will not have in the future. Of course it does not necessarily mean that you will, but every mosfet (or a relay, or a SSR, doesn't matter) has a chance of failing. Four of them, while every fail means a failed system overall, means the whole system is 4x more likely to fail. Why not use a mains voltage heater mat? Simple, well made, much more efficient (converting mains to 12V or 24V just for heating something up is nonsense by my book). Been using systems like that for years, haven't had any problems. Like in this delta machine of mine. Chamber heater consists of two 200W PTC heating elements. I could've went with 24V ones, but then I would have needed to increase my PSU size by 400-500W, in addition to 200W needed for the printer itself. Would have had to source, likely make a quite beefy MOSFET board. Would had to route much thicker cables trough the machine. But no, I used mains voltage heaters. Simple, self regulating (if the active control goes down), simple SSR, no need for a bigger, noisier, much more expensive PSU. It works, in function, just the same, just a bit more reliably.

But all this is besides the point. I think your problem was something by the lines of "it happened just to mess with you". Maybe a mains brown-out, maybe communication error with the host or SD, maybe lightning two miles away. I think it's unrelated to issue I was having.

paulluby commented 5 years ago

@vilsed

Yeah, reckon my issue is different to yours and appears to have sorted itself out.

Regards my multiple heater beds, I'm not a fan of mains voltage mats laid on metal beds. But more importantly I had 10 of the 2A beds in stock as I got them cheap. Also had loads of Arduino Nano's, MOSFETs and cable in stock, along with a few high power PC PSU's.

So it was a case of what can I do with what I already had.

That's the beauty of hobbies like this (I also design and build large RC model aircraft) we all come up with different ways to achieve the same result.

vilsed commented 5 years ago

we all come up with different ways to achieve the same result

True.

UPDATE on my situation. Not fixed. After combined 50+h of printing time the E driver again went into spreadCycle and died. Only this time stopping and restarting a print didn't help. Had to reboot whole machine. Will redo the SPI loom next.

vilsed commented 5 years ago

Hello, good people of GitHubland.

Update. Redone the SPI loom twice. Once by soldering the connectors (as I did before and was doing since time in memoriam), then using a proper crimping tool for the connector type I'm using. No difference. Tried measuring the motor for shorts while heating it up, banging and shaking. No problems there. Was also suspecting the power supply might be acting up. Borrowed a logging oscilloscope, set up a trigger circuit to activate the logging on 24V rail fluctuation of more than 0.2V - nope, no correlation. Besides, if there was - why only the E is affected? I'm suspecting I will have to plop in a A4988 driver for the E axis, I'm running out of ideas, options and patience...

vilsed commented 5 years ago

Hello all.

Interesting observation - if I force the E driver into spreadCycle mode at the beginning of a print - it does not crap out. At least it didn't during four test prints I pushed through so far. The issue is getting more and more unusual.

EDIT: Switching the driver to spreadCycle and that fixing the problem was a coincidence. Changed the E driver to A4988, it's working fine so far. Will try changing the chinese RAMPS board to a higher quality RAMPS 1.4.2 or something like that, also maybe a higher power PSU and more well designed cooling of the whole electronics compartment. Closing the issue for now.

MrStump commented 4 years ago

AAAAND again similar issue... CORE_XY, arduino DUE+ramps BOARD_RAMPS4DUE_EFB with TMC5160 in SPI mode on XY all was fine more than year, and few weeks ago (in FEB) when i change XY MICROSTEPS for some reason and reflash DUE - Y motor just starts randomly stopping firstly there was marlin release 2.0.(something) and i now tried 2.0.5.3 i change wires to new one, change DUE to the new one, even change drivers to TMC2130 SPI - all with no luck, Y-motor just stops in 30-120 min no temp warning, it's just "disabled" M122 show me that ENABLED X:true Y:false StealthChop X:true Y:false and, the wierdest, it reset default 16 to msteps X:16 Y:256 i tried in firmware Y_MICROSTEPS from 4 to 256 but with no change in situation, it just stop in some time. i have no idea what could happen here is LOG of M122 before and after issue M122

github-actions[bot] commented 4 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

MarlinFirmware / Marlin

[bugfix-2.0.x] TMC2130 Re-ARM E stepper disables randomly #12944

(If an enable pin is toggling at random for no reason I don't want to be the one trying to debug it)