MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
16.17k stars 19.21k forks source link

[BUG] TMC2130 failing on SKR1.3 during motor movement [voltage spikes] #14522

Closed phizz166 closed 5 years ago

phizz166 commented 5 years ago

Description

X-axis TMC2130 driver randomly disconnects/fails in the middle of the leveling routine. The bug is always on the X-axis, even after swapping driver boards around.

Steps to Reproduce

  1. SKR1.3 board with TMC2130s on all axes, using bugfix-2.0.x 0ca64a0 (as referenced in #14478), BLTouch, 12864 full graphic LCD (RepRapDiscount)
  2. Run G28, then G29 P1 to generate new bed mesh

Expected behavior: The bed leveling routine should complete normally.

Actual behavior: About 2/3 of the time the leveling completes normally. 1/3 of the time something happens to the X-axis at a random point in the routine and the motor ceases to turn. The routine will keep running until it is finished, but only the Y and Z axes move. (So e.g. a diagonal move to the next point becomes a Y-axis move; a move along the X-axis gives no motion). The X motor will not re-enable without resetting the board. If the routine works correctly the first time, running it again a few times will usually generate the failure.

Results of M122 before the failure, e.g. just after startup: Send: M122 : X Y Z E : Enabled true true true true : Set current 570 570 570 570 : RMS current 550 550 550 550 : MAX current 776 776 776 776 : Run current 17/31 17/31 17/31 17/31 : Hold current 8/31 8/31 8/31 8/31 : CS actual 8/31 8/31 8/31 14/31 : PWM scale 19 19 0 0 : vsense 1=.18 1=.18 1=.18 1=.18 : stealthChop true true false false : msteps 16 16 16 16 : tstep max max max max : pwm : threshold : [mm/s] : OT prewarn false false false false : OT prewarn has : been triggered false false false false : off time 4 4 4 4 : blank time 24 24 24 24 : hysteresis : -end 2 2 2 2 : -start 1 1 1 1 : Stallguard thrs 3 3 0 0 : DRVSTATUS X Y Z E : stallguard : sg_result 0 0 83 0 : fsactive : stst : olb : ola : s2gb : s2ga : otpw : ot : Driver registers: : X 0x80:08:00:00 : Y 0x80:08:00:00 : Z 0x81:08:00:54 : E 0xA0:0E:00:00 : : : Testing X connection... OK : Testing Y connection... OK : Testing Z connection... OK : Testing E connection... OK : ok

Results immediately after the failure: Send: M122 : X Y Z E : Enabled false true true false : Set current 570 570 570 570 : RMS current 994 550 550 550 : MAX current 1402 776 776 776 : Run current 17/31 17/31 17/31 17/31 : Hold current 8/31 8/31 8/31 8/31 : CS actual 0/31 8/31 8/31 8/31 : PWM scale 0 20 0 22 : vsense 0=.325 1=.18 1=.18 1=.18 : stealthChop false true false false : msteps 256 16 16 16 : tstep max max max max : pwm : threshold : [mm/s] : OT prewarn false false false false : OT prewarn has : been triggered false false false false : off time 0 4 4 4 : blank time 16 24 24 24 : hysteresis : -end -3 2 2 2 : -start 1 1 1 1 : Stallguard thrs 3 3 0 0 : DRVSTATUS X Y Z E : stallguard : sg_result 0 0 68 162 : fsactive : stst : olb : ola * : s2gb : s2ga : otpw : ot : Driver registers: : X 0xE0:00:00:00 : Y 0x80:08:00:00 : Z 0x80:08:00:4D : E 0x81:08:00:A2 : : : Testing X connection... OK : Testing Y connection... OK : Testing Z connection... OK : Testing E connection... OK : ok

Additional Information

I have tried switching the driver boards around and the problem is always on the X-axis, regardless of which board is in that slot. The overtemperature flags are not being triggered. The axis is not binding. The axis is running in stealthChop mode. The axis is using sensorless homing with sensitivity of 3. Current is set to a rather low value (570mA).

I notice that after the failure, the following settings have changed: RMS current MAX current CS actual PWM scale vsense stealthChop msteps off time hysteresis-end driver registers X

It's like the board is getting reset to some kind of default settings somehow, but I don't know enough about these drivers to figure out why.

gloomyandy commented 5 years ago

Try moving the driver board (and X motor cable) over to the unused E1 slot and changing the pins file so that X uses the pins defined for E1. If the problem is fixed then there may be some sort of problem with your control board (bad connection, broken track etc.).

phizz166 commented 5 years ago

Just did that and got the same behavior. Changed the pin definitions, moved the driver board, moved the motor cable, reflashed, ran a single G29 and the x axis (now running on E1 channel) still died just at the end of the process.

edit: not connected to UBL, see below For what it's worth, I have it set to a 3x3 grid, and it probes the points in the following order

9 5 6 4 1 2 8 3 7

Tested it three times just now and it died each time -- once on the move from 7 to 3, twice on the move from 9 back to 1.

The time that the axis quits moving is not completely predictable, but it always fails in the same way -- same 1402mA current value, same 0xE0... register setting, etc. The motor is not damaged and both phases read 2.9 ohms.

What does it mean that the MAX current is 1402mA but the set point is 570mA? I feel like that might indicate something.

EDIT: it's not just during bed leveling. I sent a bunch of diagonal movement commands over and over: G1 X0 Y0 F8000 G1 X200 Y200 F8000 and after maybe half a dozen of those moves the x-axis died again in the exact same way.

Murray-Lindeblom commented 5 years ago

Using build 8916acf I get the following: config-current.tar.gz

g28 SENDING:G28 echo:busy: processing echo:busy: processing echo:busy: processing echo:busy: processing g29 j SENDING:G29 J Tilting mesh (1/3) echo:busy: processing echo:busy: processing echo:busy: processing echo:busy: processing Tilting mesh (2/3) ?Error probing point. Aborting operation.

phizz166 commented 5 years ago

Okay, I think I have found the problem, if not a solution yet.

I guessed that the only way the TMC2130 settings could be getting reset like that would be bad data on the SPI lines -- whether caused by noise or by a software bug -- so I hooked the X driver up to a scope and started watching the SPI lines. I have found that when the axes are moving, there are periodic huge voltage transients on the lines. They last less than 400ns but spike up to 60Vpp. It's not clear whether the spike is on all the lines simultaneously, or if one is inducing it into the others.

DS1Z_QuickPrint1 DS1Z_QuickPrint2

Yikes.

In any case, X_CS is going low long enough for the junk on the MOSI line to be interpreted as a command and screw up that driver, apparently. I noticed an increase in reliability when I turned off TMC_DEBUG and MONITOR_DRIVER_STATUS, probably because the chances of a spike aligning with other data on the SPI line (query/response from the drivers) was lowered -- but that hasn't solved it completely. Now it only fails about 20% of the time.

Where might these spikes be coming from? They still appear to be random, but only happen when the motors are moving.

phizz166 commented 5 years ago

I have done a little more digging and it's an even deeper problem. The entire power system of the printer swings dramatically when one of these spikes happen. The dark blue trace is the main 24v power rail, directly off the power supply. There's nearly a 75v swing and the whole event lasts less than 500 nanoseconds. I tried a second power supply, thinking mine might just be faulty, but the same thing happens with the other one.

DS1Z_QuickPrint8

What could cause this sort of behavior? Surely these spikes are not normal? And why is it pseudo-random?

klcjr89 commented 5 years ago

@phizz166 You may be on to something here that could potentially also solve the SPI related issues with the full graphic LCD bugging out, as I reported awhile ago in my tmc5160 thread:

https://github.com/MarlinFirmware/Marlin/issues/13544

AnHardt commented 5 years ago

Pseudo random means not necessary Marlin related. If you fell fit for that job, measure the primary side of the PS. I'v sen this kind of spikes caused by heaters, air conditioners, fridges, compressors, tube lamps, saws, ... - somewhere in the house.

ghost commented 5 years ago

Brand new SKR v1.3 TMC2130 v3. I’m having the same issue as described above. I don’t own a scope. I’m not that fancy. Has this been worked out yet?

phizz166 commented 5 years ago

After thinking about it for a while, the shape and timescale of the spikes looks like what you'd see from an electric spark. I noticed that the SKR 1.3 has pretty crappy headers for the stepper motors (they are loose and shallow). I'm now wondering if or more motor cables are briefly disconnecting from the board as the printer moves around, and the spike is from a spark as the connection opens. The motor coils have stored magnetic energy, and suddenly opening the circuit will convert it into electricity like how an old-fashioned automotive ignition system works. I noticed that sending the head around at high speeds (F10000) gives many more spikes than driving it slowly (F2000) which might be consistent with cable motion causing the error.

I'm going to take another look with the scope in a day or two and see if I can cause the spikes by wiggling the cables while the head is moving slowly. In the meantime you might try securing all your cables as tightly as possible and even hot-gluing them in place.

These are JST-XH headers, right? I can never keep the types straight. If this turns out to be the problem I guess I'll switch all my motors to this style of plug before gluing them in place.

image

klcjr89 commented 5 years ago

@phizz166 What you're observing with higher speeds (and all of these spikes) may be due to back EMF generated by the motors? I highly doubt the connectors are loosening.

AnHardt commented 5 years ago

Also thought about back EMF - but that's unlikely. The difference in current in between the micro-steps is low. Handling that is the drivers job. Switching from 'normal' current to 'standby' current or 'disabled' should not occur while stepping. Maybe when the 'enable pin' randomly changes the state. But then we are likely back to lose connectors.

phizz166 commented 5 years ago

I think have solved the problem! It turned out to be caused by static electricity. The solution was to run a ground wire from the extruder carriage and the X-axis rail down to the printer frame (which is earthed). If you are experiencing problems like what I've described, use a multimeter to check if the moving parts of your printer are grounded, and if they aren't, ground them.

Details

At that point I felt like a wizard was playing a curse on me. I'd removed every single variable and it appeared that just the movement of the carriage back and forth, even with nothing on it, was causing the spikes. I wondered if it even mattered if the carriage was being driven by the motor, or if it was literally just the motion? So I grabbed the carriage (still disconnected from the belts) and whipped it back and forth a few times by hand while the motor was spinning in place. Suddenly I got another spike.

I tried that several times and was able to reliably create the voltage spikes by hand, just by zipping the disconnected carriage back and forth on the axis. Just the metal plate on wheels going back and forth -- no motors were turning and no part of the printer's electrical system was involved. That blew my mind at first, but after puzzling about it for a bit the only conclusion that made sense was that the carriage was building up a static charge from its back-and-forth motion, and it was being periodically discharged into the machine's electrical system. The carriage runs on polycarbonate wheels around an anodized aluminum extrusion. The triboelectric effect doesn't work on conductors, but the aluminum oxide layer on an anodized part is an insulator. Could polycarbonate rolling on Al2O3 have this effect?

A static effect explains some of the other things I'd observed well, too. If you move the axis consistently, you should get fairly consistent spikes as the charge builds up at a stable rate, but it won't be totally exact -- which is what I saw. At higher speeds you build charge more quickly, so the spikes happen faster. If you go very slowly, the charge might leak away more quickly than it can build up to jump the gap, so you never see a spike. I saw both of those phenomena too.

Obviously if the problem was related to static charge, grounding everything involved should be the solution. I checked the extruder carriage and the X axis rail and found that neither of them had continuity to ground (which makes sense, since they're isolated from the frame by their plastic wheels). So I ran a grounding wire from each of those parts to the frame, fired it up again, and it works as it should. I haven't run a full print yet, but I have had the head moving around in various patterns for half an hour continuously at maximum speed and haven't seen a single spike yet, where beforehand I would see one after less than 5 seconds. The automatic bed leveling pattern -- the initial problem -- also works correctly every time. Problem solved!

Conclusion

So it turned out to not be a problem with Marlin at all. It was static electricity building up from the motion of the x carriage, getting discharged into the printer's electrical system, and wreaking havoc with the TMC2130s because of their sensitive communication lines. For all I know the printer has been making these little static pops for its entire life, but they never caused a problem until I installed electronics that are sensitive to the noise.

I've noticed there are a lot of recent posts about issues involving the SKR 1.3 and TMC drivers. If these users are coming from an AVR board and A4988s, as I was, their issues might also be due to improper grounding that didn't cause problems before. It might be worth adding a note about this in the comments for the TMC configuration section of configuration_adv.h (or just on this page http://marlinfw.org/docs/hardware/tmc_drivers.html) so that other people don't tear out their hair. Something along these lines:

Note that Trinamic TMC2130 drivers, and potentially other drivers using SPI communication, are very sensitive to noise on the SPI control lines. Moving parts of a printer can build up a static charge that may discharge into your printer's electrical system and cause driver communication errors. This is a known phenomenon in some printers that use polycarbonate wheels running on aluminum extrusions. Ensure that all moving parts of your printer mechanism, including the extruder carriage, are properly grounded to the rest of the frame and to the earth.

Hope this helps! I wouldn't exactly say it's a "fun" experience but it sure has been enlightening :)

klcjr89 commented 5 years ago

This may finally solve the mystery related to SPI attached full graphic display issues and TMC drivers! Awesome!

AnHardt commented 5 years ago

Congratulations. A very good example for systematic debugging.

The problem may be more general. In 3.3V systems the voltage difference between 'high' and 'low' is much lower than in 5V systems. So 'noise' may have a much higher 'volume' before the other 'state' is triggered, in a 5V system.

DavidThijs commented 5 years ago

Interesting topic, makes me think that I should also ground my metal frame. I also noticed that TMC2130 stepsticks prefer to have their SPI connections in parallel and not looped from one stepstick to another. I have spent a lot of time and making jumper cables before I discovered that. Even the ordered jumper cables didn't work on my SKR1.1 board. I always had random errors when issueing a M122 until I made the connection parallel ; which is not something you could do with the SKR1.3.

The TMC's are indeed very picky on the signal quality and since inductive loads generate a lot of transients which cannot be easily filtered out, it does surprise me that there aren't more problems reported.

github-actions[bot] commented 4 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.