MarlinFirmware / Marlin

Marlin is an optimized firmware for RepRap 3D printers based on the Arduino platform. Many commercial 3D printers come with Marlin installed. Check with your vendor if you need source code for your specific machine.
https://marlinfw.org
GNU General Public License v3.0
16.23k stars 19.22k forks source link

Two observed stepper signal discrepancies during diagonal moves #2370

Closed joshwills closed 9 years ago

joshwills commented 9 years ago

During all of the below testing, unless otherwise noted, the movements that showed an issue were from X0 Y0 to X220 Y245, or back to X0 Y0, at F90000 (faster than the machine or firmware can go, but it produced this issue reliably; I've seen this occur at lower speeds, but less frequently). These tests were performed on a MakerGear M2 using a stock RAMBo (basically an Arduino Mega 2560 - 16MHz clock, etc.), and a copy of the main Release branch of Marlin downloaded the morning of 7/2/2015 (most recent change at that point 5 days prior; "commit b1dc722c83864ad4ed61b1d4b79c51a7fb57c01e").

The following changes were made to Marlin before testing: configuration.h: -Baud rate set to 115200 -Motherboard set to BOARD_RAMBO -T0 and T1 thermistors changed to table 0 -Added custom PID values, commented out defaults -Changed Z home direction to 1 (MAX) -Disabled software endstops -Changed XYZ MAX_POS values to match machine -Steps/mm values changed to {88.88,88.88,400,471.5} -Changed Z Home and max feedrates to 20mm/s and 25mm/s respectively

Given these settings, and the default max step frequency of 40,000, the actual max XY motor feedrate is ~450mm/s; the default Marlin value of 500mm/s was left for this testing.

Test setup: DS1074Z-S oscilloscope probing the X- and Y-STEP test points between the Atmega2560 chip and the A4982 stepper driver chips, with and without motors physically connected to the drivers. Manual and macro gcode being sent from Pronterface.

Observed behavior: Issue 1: During a diagonal move (primarily from (0,0) to (220,245) for testing) a glitch occurs in the middle of the move, causing the machine to shudder violently, frequently resulting in a layer shift. This glitch is the result of a period of time where the firmware/Atmega2560 does not send STEP signals to the stepper motor driver; as the axes are in motion at the time, this is the equivalent of trying to immediately go from 200+mm/s to 0mm/s, which the masses of the axes and the strength of the steppers cannot support.

The non-signal time is between 32.7 and 32.8ms long (I suspect 32.768ms [16MHz timer with 8 prescaler, 16bit timer overflow period], but cannot measure a gap that large, accurately enough, easily)[1]. Sometimes this glitch occurs multiple times in a row, separated by some number of short periods of correct-signal-sending[2,3]. There seems to be a slight increase of glitch occurrence during the transition from 2 steps/interrupt to 4 steps/interrupt (or 4>2)[11,12], but it also occurs in the middle of a solid block of 4 steps/interrupt signals[3].

Issue 2: While I've investigated this one less, during my testing/investigation of Issue 1 above, I found that there is occasionally a set of max speed commands sent at the end of a single move, after the expected deceleration; generally this is a set of 7 blocks of 4 steps/interrupt, with a 1 step/interrupt block before and after the set of 7[8,9,10].

Investigation for Issue 1: Have tested many different things, including: -Arduino version: tested 0023, 1.0.5, 1.6.3, and my primary version of 1.5.5, with no apparent difference. -Disabled all other functions from the main Marlin loop, one by one and all at the same time (so no manage_heater, manage_inactivity, checkHitEndstops or lcd_update), with no apparent difference. -Tested changing the MAX_STEP_FREQUENCY from 40000 to 35000; rate of glitch occurrence did definitely drop, though the glitch did still occur (went from ~1 glitch in 3 moves, to ~1 in 20). -Glitch also observed when XY feedrate limited in compiled firmware to 300mm/s, though rate of occurrence did decrease. -Compared X- and Y-STEP signals against CLKO output, which show timing changes between when steps are being sent, and step idle time; during the glitch period, the CLKO signal continues, while X- and Y-STEP signals do not. I believe that this shows that the timer responsible for step timing/signaling is still active, though my AVR/Arduino/Marlin knowledge is not deep enough to confirm that.[4,5,6,7] -Briefly tested Repetier-Firmware and saw similar behavior. -Briefly tested our "stock" Marlin firmware from 2012, and saw similar behavior.

The included pictures are screenshots from the 'scope, showing various tests and measurements of these issues. In all pictures, the yellow trace is CH1/X-STEP, the blue trace is CH2/Y-STEP, and the purple trace (if present) is CH3, CLKO; all three are offset vertically for clarity. 1: A shot showing the gap timing of ~32.7ms (measured with cursors placed by hand/eye). 2: A shot showing the general, correct movement profile with full move timing of ~422ms. 3: A shot showing the same move with two gaps, separated by some steps; total move timing of ~488ms. 4: A shot showing the continuation of the CLKO output during the gap in X- and Y-STEP signals, at the beginning of the gap. 5: Another view of 4. 6: The same as 4, but viewing the end of the gap (X- and Y-STEP signals starting again). 7: A zoom of normal movement showing the timing changes of CLKO during idle, X-STEP, [between X- and Y-STEP], Y-STEP, and back to idle. 8: A shot showing the timing increase as the axes decelerate at the end of a move. 9: A zoom of the end of 8, showing the block of high-speed moves at the end of the deceleration ramp (axes should be on their way to 0mm/s speed/full stop at this point). 10: A further zoom of 9, showing 7, 4 step/interrupt blocks with 1 step/interrupt blocks before and after, and then X- and Y-STEP signals stopping completely. 11: A shot showing a gap, zoomed in on the left side, showing the 4 step/interrupt blocks. 12: A shot showing the right side of the gap in 11, showing the 2 step/interrupt blocks.

If there are any other tests you guys would like me to run, or if you have any questions about my testing/setup, please let me know - I've been digging through Marlin to try to figure out what could be happening here, but haven't had any luck, so finally decided to come to you all with this information.

1: 1 2: 2 3: 3 4: 4 5: 5 6: 6 7: 7 8: 8 9: 9 10: 10 11: 11 12: 12

paulusjacobus commented 9 years ago

@maicrodrop does this glich also appear on slow configurations like printing at 40mm/sec?

joshwills commented 9 years ago

That's one I'll need to check; I don't have a good guess either way, as, while I've not seen any issues like this during normal printing, normal printing is usually interspersed with high-speed non-print moves which can exhibit this issue, so can't really isolate the two sections. Also, 40mm/s is slow enough that the axes likely won't have enough inertia to skip, leaving no physical evidence of the glitch.

I'm running a test right now, moving back and forth from (0,0) to (220,245) at F2400, 'scope set to trigger once on a gap between 32.4 and 33ms wide, which I tested at F90000 and does catch the glitch; I'll let it run for an hour or two, just to see if it catches anything - if it does happen at F2400, I suspect it will be very rare...

Upon further reflection, this could be related to an issue I've seen every once in a while - primarily while homing, and usually on a printer that's been running for a while, the axes will stutter while moving, but not in the "coil disconnected/shorted/motor failed in general" way; even with the number of printers I've worked with, I've only seen that happen once every couple months, and then just briefly, so haven't had a chance to look into it too much. I also haven't found a way to cause it to happen reliably, which further hinders studying it....

EDIT: Alright, the test has been running for ~2.5 hours now (ignoring acceleration time [which is negligible on this move/speed] that's over 1,000 cycles), and there hasn't been a glitch yet. It doesn't prove that the glitch cannot occur at 40mm/s, of course, though 0/~1100 is a pretty good indicator that it's at least pretty rare at this speed.

thinkyhead commented 9 years ago

It's entirely possible that the CPU simply can't keep up with so much stepper activity. I'm not sure where the threshold is, but you could try adding delays into the main loop and main interrupt to simulate a lot of processing and see if it gives similar behavior.

joshwills commented 9 years ago

An update on this testing - I returned to one of the first times I isolated this issue (an in-house compiler change from Arduino IDE 1.5.5 to 1.6.3 started causing skipping in prints), and have found that a similar test as above, but at F18000 instead of F90000 (so 300mm/s, perfectly reasonable for our machine) can show the same glitch. In 1.5.6 and below, the glitch is very rare (less than 1 in 50), though does occur, at F18000; in 1.5.7 and above, the glitch is much more common (1 in 4 or higher) at F18000. It's hard for me to isolate the IDE changes, as there were quite a few, but one major one was a change of avr-gcc versions.

So to summarize - compiler/IDE version change makes this issue significantly worse.

Wackerbarth commented 9 years ago

I really appreciate your detective work -- "I isolated this issue (an in-house compiler change from Arduino IDE 1.5.5 to 1.6.3")

In going from 1.5.5 to 1.6.3, there are actually 3 areas that change. 1) As you note, the compiler 2) The compiler options 3) The underlying Arduino core

The 1.6.3 IDE allows you to override the compiler path (in platform.local.txt). Could you try extracting the compiler from 1.5.5 to some local location and then using it under the 1.6.3 IDE to see if we can further isolate the problem?

joshwills commented 9 years ago

I did actually isolate it further, as above - in 1.5.6 the firmware compiles and runs fine, with minimal glitching (though extant); in 1.5.7 the glitching is much more pronounced. I'll see about the compiler modification test shortly.

thinkyhead commented 9 years ago

The 1.5.7 changelog, for reference: ARDUINO 1.5.7 BETA - 2014.07.07 [core]

[ide]

[libraries]

[core]

[ide]

[libraries]

[firmware]

joshwills commented 9 years ago

Alright, sorry for the delay here - it took me some time to get to this test, and then some time to actually figure out how to test it... Preliminary results with the following test:

Force Arduino 1.6.3 to use the avr-gcc bin from Arduino 1.5.5 (so using avr-gcc 4.3.2 instead of 4.8.1).

Compiles and uploads our "stock" older Marlin that shows the same glitch behavior as compiling and uploading from 1.5.5 directly - at F90000 fairly reliable glitching, but at F18000 rare glitching (the latter difference being the primary difference between stock 1.6.3 and this test configuration).

This seems to point to the compiler change being the factor that made the glitching worse. My next test/investigation will be trying to figure out what has changed between avr-gcc 4.3.2 and 4.8.1, in the hopes that that can point me towards what causes this in the first place.

Wackerbarth commented 9 years ago

"This seems to point to the compiler change being the factor that made the glitching worse." -- First, I suggest that you look at the optimization flags, etc. From some of the things that I have seen on another project, they had to make some changes to get equivalently fast code. (I'm guessing that the newer compiler handles more complex C++ constructs and that the default code generated is slower) But, with appropriate compiler flags, the simplified case can be made to compile into the more efficient form)

dcnewman commented 9 years ago

Putting the compiler change aside, could this be related to a known issue with Marlin, Grbl, Sailfish, etc. when acceleration_time is very small in the stepper interrupt? The cumulative acceleration_time is very small when moving very fast at the start of a trapezoid's acceleration or deceleration. And, when that value is very small relative to acceleration_rate there can be underflows in

MultiU24X24toH16(acc_step_rate, acceleration_time, current_block->acceleration_rate);

That op does the multiply of acceleration_time by acceleration_rate by escaping to higher integer bit-width and then shifts off the low 24bits of the result. (Integer arithmetic is used to avoid slowing the interrupt down too much.) When acceleration_time is sufficiently small relative to acceleration_rate, that computation produces zero causing the step rate to not change, to neither increase nor decrease. It will continue to produce zero until sufficiently far along the accel or decel ramp (trapezoid leg) that enough time increments have accumulated to make acceleration_time sufficiently large.

You could test this theory out by increasing the acceleration for that trapezoid by, say, a factor of 4 or 8 and then see if the duration of the glitch goes down.

On compiler changes, note that since some point in 2012, Sailfish was locked down to building on 4.6.2 of the gcc toolchain owing to problems which arose from changing compilers. In 2011, gcc 4.3 was used with no issues. When attempting to upgrade to either gcc 4.4 or 4.5 some issues were encountered. The issues were identified as a known bug in one of the avr-gcc libraries. Advice was to move to 4.6 which did then resolve that issue. Later, when an attempt was made to use gcc 4.7, a couple of unusual problems arose, one related to the stepper interrupt and which produced, IIRC, some stuttering and unexpected clunks in motion or some sort of motion irregularities. Several solid weeks of investigation were spent to no effect. At one point, another group then in the process of adopting the core of Sailfish spent a week or two with no results either. It was eventually decided to stick with gcc 4.6. It's possible that the cause was/is a bug in Sailfish exacerbated by gcc. The matter was never resolved satisfactorily. Point being, at least one similar firmware with roots in Marlin has seen issues arise when moving to gcc toolchains later than 4.6.

alexborro commented 9 years ago

This issue depends on the optimization of the compiler. It is caused by the time the stepper ISR takes to complete. There are two things you can do to solve it:

1) disable the endstop checking during regular movements. This will save some cycles on ISR.

2) There is a variable in stepper.cpp defining when to use double or quad step. It is set to 10000 by default. Lower it to 9000 or 8000 and try. This will also give some room to ISR execution time limit.

Cheers.

Alex.

joshwills commented 9 years ago

alexborro, I believe you're correct that this is caused by the stepper ISR execution time; I posted a request for input on our forum (http://forum.makergear.com/viewtopic.php?f=10&t=2568 ) and one user (jsc) almost immediately pointed out that, if the ISR execution time takes longer than the interrupt timer that triggers it, the interrupt timer will have to overflow again to actually trigger the ISR. His simple fix (a single line of code at the end of the ISR, which tests if the interrupt timer is greater than the trigger time, and if so, resets the timer to 0) has tested fairly well so far - I'm going to be implementing it in a production firmware and printing with it shortly. jsc also had some other versions of the fix, which I'll test as well.

dcnewman - thank you for the information on Sailfish and that MultiU... issue. I don't think that's valid for the primary issue here, though - I don't think I've ever seen the glitch during the "leg" sections of the trapezoidal accel/decel profile, which from what I understand of your post, is the only time that specific issue would occur. It may explain "issue 2" though, so I'll look into testing it; issue 2 in general was harder to replicate, so that may take some time.

For the Sailfish/compiler information - that is very interesting. I wish there was a better repository for this information in general. I personally started recommending using Arduino IDE 1.5.5 as soon as I started characterizing these issues, so I'm not against simply saying "use this, it works, don't worry about it", even if it does rankle a bit as a (general, semi-) engineer.

As always, thank you all for the input and assistance here - it's been a fun trip so far...

dcnewman commented 9 years ago

@maicrodrop FWIW, in Sailfish we've also had to be careful about missed ISR deliveries. You definitely get a "clunk" sound when issue N+1 of the stepper interrupt cannot be delivered because issue N is still executing. Prevention in Sailfish hasn't been any different than Marlin (until now): limit the max step frequency and take pains to make the stepper interrupt code as efficient as possible. Sailfish, however, has been more conservative, using a lower maximum step rate. From timings back in 2012, the average Sailfish stepper interrupt execution was around 60 - 65 uS and Sailfish has a slight advantage: the extruder steps are done in a different interrupt (so as to allow pressure to be built up or retarded at rates different from the XYZ-space motion). Moreover, in the (smaller) pond that Sailfish swims in, USB is used LESS frequently than SD cards for printing. Consequently, it has been acceptable to make coding decisions which reduce USB I/O interrupt performance in favor of other interrupt-level activities.

What I've been curious about (and this is actually on topic), is how the dc42 branch of the Duet RepRapFirmware is dealing with lost deliveries. That branch doesn't use Bresenham's line algorithm and instead attempts (with a single interrupt) to run each axis at its required speed. I believe that David Crocker (dc42) has taken the same approach that you are looking at: reset the timer to 0 when necessary to prevent a lost delivery. (However, I believe that he has also had to reduce the overall maximum step rate from that used by other branches of the firmware.) [Now, offtopic] I actually have a Duet board in hand and will be installing it in a Core-XY printer soon. I'm curious to see if dropping Besenham will change in anyway some of the very faint printing artifacts we all see from time to time and occasionally people attribute to "aliasing" (beat patterns) in Bresenham.

Wackerbarth commented 9 years ago

@boelle -- This thread is important. Please port it over to MarlinDev and let's continue it there.

boelle commented 9 years ago

moved to: https://github.com/MarlinFirmware/MarlinDev/issues/35

fiveangle commented 6 years ago

What was the outcome of this discussion ?

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.