Speed up (~1.5x) and fix planner, including judder due to a planner-stepper race

mh-dm commented 2 weeks ago

Description

Significant speed up of planner by reducing SQRTs calls from Planner::recalculate() by ~73%.

Fix potential for significant judder due to a planner-stepper race / planner issue that's likely when there are many fast short segments.

(Edit: this part was split off into #27089) Ensure the existing minimum_planner_speed_sqr limit is correctly enforced and propagated to calculate_trapezoid_for_block(), which:

allows removing MINIMAL_STEP_RATE magic number / undocumented limit.
fixes judder between a segment with (very) low acceleration and a following segment with high acceleration

(Edit: this part to be fully completed as part of #26881) Removing MINIMAL_STEP_RATE allows for the correct handling of moves with low acceleration.

Removes block_flags_t nominal_length and Planner::block_buffer_planned.

Removes Planner::forward_pass() and instead the forward pass is done as part of Planner::recalculate_trapezoids().

Requirements

Applies to all motion.

Benefits

The main benefits are most obvious by looking at the steps with a logic analyzer similar to how it was done in #27013

Before

before_optimal~2

This is a capture with #27013 applied which already fixed other many acceleration spikes.

1) The first acceleration spike/speed discontinuity is due to a planner-stepper race / planner issue within recalculate(). The reverse pass grossly maximizes a block->entry_speed then the forward pass is supposed to come in and fix it so that block->entry_speed can be reached within the previous block assuming full acceleration. However, if the previous block meanwhile becomes busy (the stepper takes it on) then block->entry_speed is not fixed. block is then recomputed and gets a really fast start. The condition for this to happen is that all the segments in the buffer are not enough to get to full speed, i.e. they're still actively planned. Which also means that this issue can happen even if the buffer is more than half full so it can happen even with SLOWDOWN enabled. It was tricky to reliably reproduce this issue and capture it with a logic analyzer so see Configurations.

2) The marked low speed move is supposed to have a very low acceleration (10 steps/s^2) but instead directly starts of at a rate of MINIMAL_STEP_RATE then does a weird accel and decel. It's supposed to be a simple full acceleration.

3) The second acceleration spike/speed discontinuity is due to minimum_planner_speed_sqr limit being lost on the way to calculate_trapezoid_for_block(). Specifically the forward pass limits block->entry_speed based on the previous block's acceleration and if that's low it will fall under minimum_planner_speed_sqr. And if it does we will end up with the second step of a segment having an acceleration way past limits. The first step of a segment is always at intial_rate. If it's too slow, it will result in too much accumulated acceleration_time (see stepper.cpp) and that means a very high calculated second step rate, and the difference is the acceleration spike.

After

after_optimal

1) Planner-stepper race judder is fixed in recalculate_trapezoids(). If the previous block is busy we revert the forward pass entry_speed changes to (the newly added) block->min_entry_speed_sqr. And yes I've ran it 10+ times to make sure.

2) MINIMAL_STEP_RATE is gone and now low acceleration moves are handled correctly.

3) minimum_planner_speed_sqr is correctly passed through and enforced through (the newly added) block->min_entry_speed_sqr.

Speeeeed

recalculate_trapezoids() started at block_buffer_tail and for every single move it calculated SQRT(next->entry_speed_sqr).

So don't do that and instead only do SQRT for the entry speeds of blocks that actually need to be recalculated (there's some complexity to actually achieve this). Some before / after numbers for my test gcode and a more realistic gcode (modified section of a vase mode/spiralize print, 174 moves at 100mm/s):

before
227sqrts 52speed_checks 64calculate_trapezoids
1359sqrts 227speed_checks 370calculate_trapezoids

after
62sqrts 35speed_checks 61calculate_trapezoids
369sqrts 233speed_checks 368calculate_trapezoids

I manually tracked calls instead of just tracking the cycles or timer ticks because the stepper ISR jumps in often and resets the timers. With some math and handwaiving, the 73% reduction in SQRT calls (the most expensive single operation) means about ~1.5x faster planner, at least for the timings I get on my test platform LPC1769.

Configurations

How I got the planner-stepper race issue to semi-reliably reproduce (about 40% of runs):

--- a/Marlin/src/module/planner.cpp
+++ b/Marlin/src/module/planner.cpp
@@ -117,5 +117,5 @@
 // fewer movements. The delay is measured in milliseconds, and must be less than 250ms
 #define BLOCK_DELAY_NONE         0U
-#define BLOCK_DELAY_FOR_1ST_MOVE 100U
+#define BLOCK_DELAY_FOR_1ST_MOVE 0U

 Planner planner;
@@ -1000,4 +1000,5 @@ bool Planner::reverse_pass_kernel(block_t * const current, const block_t * const
       NOMORE(new_entry_speed_sqr, current->max_entry_speed_sqr);
       if (current->entry_speed_sqr != new_entry_speed_sqr) {
+        delayMicroseconds(200);

         // Need to recalculate the block speed - Mark it now, so the stepper

and the gcode used in the capture

G4 P750 ; Delay for capture
G92 X10 Y10
M92 X80 Y80 ; 80 steps per mm
M204 S500
M205 X0.3 Y0.3
G0 F1800 X10.1
G92 Y10 ; sync test
G0 X10.2
G0 X10.4
G0 Y10.1
M204 S10
G0 F1800 X10.5
M204 S1000
G0 F1800 X10.6
M204 S500
G0 F1500 X11.5
G0 F1200 X12
G0 F1800 X13.5
G92 Y10 ; sync test
G0 F1400 X14.1
G0 F1600 X15.2
G92 Y10 ; sync test
G0 F1800 Y10.1
M204 S250
G0 F2000 X15.33
M204 S350
G0 X15.55
M204 S500
G0 X15.85
M204 S750
G0 X16.5
M204 S420
G0 X16.77
M204 S250
G0 X17
G92 Y10 ; sync test
M204 S500
G0 Y10.1
G0 F600 X17.4
G0 F1800 X18.95
G0 F600 X19.3
G0 F1800 X20.9
G92 Y10 ; sync test
G0 F600 X21.3
G0 F1800 X23
G0 F600 X23.4
G92 Y10 ; sync test
G0 F1800 X25.2
G0 F1500 X26
G0 F1200 X27
G0 F1800 X24.5
G0 F6000 X10

Related Issues

Depends on #27013 (edit) Needs to be reconciled with #26881 It seems like this change should fix #26274

tombrazier commented 2 weeks ago

1.5x - impressive.

tombrazier commented 2 weeks ago

As a general comment, having three PRs operating on the same code at the same time is a bit scary. And there is a lot of change to the complex trapezoidal calculation code in this PR which is also a bit scary.

What would help me in a lot in reviewing this is... I think I'd like to see the MINIMAL_STEP_RATE stuff removed from this PR and from #27013 and handled solely in #26881. Then I'd like to see #27013 completed. Then I would like to see this PR broken down into smaller commits each of which makes one logical change that I can get my head around.

I'm quite keen to see this happen because from what I can see so far the idea of massively reducing the number of sqrts is excellent. I wish I'd spotted it myself!

mh-dm commented 2 weeks ago

I agree it's getting complicated. That said, this change ensures that minimum_planner_speed_sqr is correctly passed through and followed such that the resulting initial/final_rate are not degenerate in edge cases, and that's what allows removing NOLESS(initial_rate, uint32_t(MINIMAL_STEP_RATE));.

If you want to see this PR broken down into smaller commits.. it kinda is already. There are 3 commits: 1) 549a4f4 which is exactly #27013. 2) 5717a39 which is ensuring that minimum_planner_speed_sqr is correctly passed through by adding min_entry_speed_sqr. 3) 50a38bb which rewrites the forward pass to be done as part of recalculate(), removes the planner-stepper race issue and removes the sqrt calls.

I'm quite keen to see this happen because from what I can see so far the idea of massively reducing the number of sqrts is excellent. I wish I'd spotted it myself!

I appreciate it.

thinkyhead commented 1 week ago

Another quickie optimization… For values that don't exceed +- 2^30 (i.e., long time periods? counting clock cycles?) it's safe to square before coercing to float (instead of coercing to float before squaring) and this saves some small number of cycles. In case this might differ per platform I've applied a FLOAT_SQ macro to call it out.

mh-dm commented 1 week ago

Another quickie optimization… For values that don't exceed +- 2^30 (i.e., long time periods? counting clock cycles?) it's safe to square before coercing to float (instead of coercing to float before squaring) and this saves some small number of cycles. In case this might differ per platform I've applied a FLOAT_SQ macro to call it out.

For nominal/initial/final_rate I don't think that's generally safe to do. We don't want to overflow int32_t so rates should be under sqrt(2^31) and with a very common (ender3) 80 steps/mm that's ~580mm/s. Uncommon but not unrealistic. With x32 or x64 instead of x16 microstepping that's ~290mm/s or ~150mm/s, and 300mm/s is common for travel moves.

I'd suggest making a check against overflow before integer squaring but that would probably defeat the optimization for some architectures since multiplication is quite fast on most microcontrollers. Regardless, seems like this square optimization is best left for a separate PR.

InsanityAutomation commented 1 week ago

Its actually not that uncommon to see rates well over that nowadays. Im more than happy to test changes against a 1.2m machine with nema23's and 5160s capable of moving over 1m/s. Gives us room to actually see it hit speed even if we widen up the accel curve.

Machine is a little off base with an LPC4078 MCU and compiled with O3 instead of default Os typical... @p3p helped alot getting it up and running, but will eventually get to where that platform gets merged up.

tombrazier commented 1 week ago

For nominal/initial/final_rate I don't think that's generally safe to do. We don't want to overflow int32_t so rates should be under sqrt(2^31) and with a very common (ender3) 80 steps/mm that's ~580mm/s. Uncommon but not unrealistic. With x32 or x64 instead of x16 microstepping that's ~290mm/s or ~150mm/s, and 300mm/s is common for travel moves.

I agree. With and AVR Marlin simply cannot do more than about 8k steps/s but many 32 bit mainboards could easily do over 50k steps/s, at which point the square will overflow an int32.

Regardless, seems like this square optimization is best left for a separate PR.

Also agree. This PR is complex enough already.

sjasonsmith commented 1 week ago

I edited the description to make it easier to know this depends on the other PR, and labeled this as "Don't Merge" for now.

thinkyhead commented 2 days ago

I went ahead and did a merge, but if you want to squash and rebase the simplest way is to fetch the latest bugfix-2.1.x and use git reset like so …

git checkout recalculate
git reset bugfix-2.1.x
git add .
git commit -m "Speed up planner, fix judder, etc."
git push -f

mh-dm commented 2 days ago

@thinkyhead I think you've missed the overflow concerns about FLOAT_SQ expressed here, including by @tombrazier. This is a heads up since I noticed you've already submitted "🧑‍💻 FLOAT_SQ macro" commit 5f96dffb9b50baf1e730c21405a573ad45b45e7d to bugfix-2.1.x branch.

thinkyhead commented 2 days ago

Thanks for the heads up, @mh-dm. I sometimes forget the quirks of float / int maths and the way a modest 32-bit float magically deals with large squared numbers. I'm tempted to go with a cast to 64-bit integer before squaring, but then the compiler probably gets annoyed about the coercion to 32-bit float, and some of the point of optimization gets lost. So for now, just casting to float before squaring, with the macro as a formal reminder of the OoO.

mh-dm commented 1 day ago

Rebased/resynced and force pushed. As both #27013 and #27089 were merged - btw, thank you all for comments, reviews, contributions! - this PR now only includes the last commit which does this part:

Significant speed up of planner by reducing SQRTs calls from Planner::recalculate() by ~73%.
Fix potential for significant judder due to a planner-stepper race / planner issue that's likely when there are many fast short segments.

I've slightly changed the way initial_rate is updated (as per my previous comment) and did some testing to confirm it still works (and I've also compared rebased vs merged code versions as a sanity check).

Note that I've left out removing the MINIMAL_STEP_RATE of 120 as that's to be fully completed as part of #26881.

MarlinFirmware / Marlin