miklschmidt commented 1 month ago

Microstep Shift in Z-Axis Probing

Issue Description

During Z-axis probing operations, there's a consistent microstep difference between the trigger position and the halt position that should not be present, depending on driver microstep settings and probing speed. This difference is added to the commanded stepper position which causes it to accumulate over multiple probe moves, resulting in inaccurate bed meshes (with probes that require movement in Z) and potentially affecting first layer consistency and printed geometry.

Reproduction Steps

Configure Z steppers to use 64 microsteps
Set probing speed to 3 mm/s
Apply diff in the relevant code section to log the microstep difference.
Perform Z probing operations (e.g loop PROBE and optionally GET_POSITION at the same x/y position).
grep "Probe move" ~/printer_data/logs/klippy.log to show the step difference, and grep "Setting toolhead position" ~/printer_data/logs/klippy.log, to show the continuous drift.

Note: this may require a sufficiently high resolution z-axis or faster probe speeds to reproduce, this bug is consistently reproduced with TR8*4 leadscrews (a rotation distance of 4mm), 1.8 degree steppers, 64 microsteps and 3 mm/s of probing speed. Anything resulting in more steps generated pr sec will show the issue. The microstep shift scales with steps per second.

Observed Behavior

A consistent microstep shift occurs between trigger and halt positions during probing
The shift accumulates with each probe move, moving the nozzle closer to the bed
Bed mesh results are shifted, with error scaling with point count and probe samples

Expected Behavior

Trigger and halt positions should align without a consistent shift during probing operations. Unsure about multi-mcu setups, but the issue would be present there as well.

Additional Observations

The issue is specific to probing; homing exhibits the same shift but doesn't have major real-world impact as the position is reset, so no accumulation.
Reducing microsteps to 16 eliminates the issue when probing at 3 mm/s.
Driver interpolation does not affect the outcome.
Setting ENDSTOP_SAMPLE_COUNT to 1 eliminates the issue at the cost of disabling noise filtering (which seems unnecessary on modern printer hardware)
Increasing ENDSTOP_SAMPLE_COUNT exacerbates the problem (e.g., 32 samples increases error to 4-5 microsteps per probe move in our case)
The issue has been reproduced on several machines with different stepper drivers (TMC2209, TMC5160)
Some setups exhibit larger errors than others despite using the same stepper and driver configurations
Increased probing speeds may increase the error.
There are possible edge cases where X and Y axes may drift as well.

Relevant Code

The microstep difference was logged by applying the following patch:

diff --git a/klippy/extras/homing.py b/klippy/extras/homing.py
index 06b52f1ec..a4a3051b4 100644
--- a/klippy/extras/homing.py
+++ b/klippy/extras/homing.py
@@ -121,6 +121,7 @@ class HomingMove:
             haltpos = trigpos = self.calc_toolhead_pos(kin_spos, trig_steps)
             if trig_steps != halt_steps:
                 haltpos = self.calc_toolhead_pos(kin_spos, halt_steps)
+            logging.info("Probe move: trigger at %s -> halt at %s", trig_steps, halt_steps)
         else:
             haltpos = trigpos = movepos
             over_steps = {sp.stepper_name: sp.halt_pos - sp.trig_pos
@@ -130,6 +131,8 @@ class HomingMove:
                 halt_kin_spos = {s.get_name(): s.get_commanded_position()
                                  for s in kin.get_steppers()}
                 haltpos = self.calc_toolhead_pos(halt_kin_spos, over_steps)
+            logging.info("Homing move: trigger at %s -> halt at %s (%s over steps)", trigpos, haltpos, over_steps)
+        logging.info("Setting toolhead position to %s (was: %s)", haltpos, self.toolhead.get_position())
         self.toolhead.set_position(haltpos)
         # Signal homing/probing move complete
         try:

Sample Data

With 64 microsteps and 3 mm/s probe speed, the following log output demonstrates the issue:

Probe move: trigger at {'stepper_z': -9578, 'stepper_z1': -9578, 'stepper_z2': -9578} -> halt at {'stepper_z': -9579, 'stepper_z1': -9579, 'stepper_z2': -9579}
Probe move: trigger at {'stepper_z': -9602, 'stepper_z1': -9602, 'stepper_z2': -9602} -> halt at {'stepper_z': -9603, 'stepper_z1': -9603, 'stepper_z2': -9603}
Probe move: trigger at {'stepper_z': -9601, 'stepper_z1': -9601, 'stepper_z2': -9601} -> halt at {'stepper_z': -9602, 'stepper_z1': -9602, 'stepper_z2': -9602}
Probe move: trigger at {'stepper_z': -9593, 'stepper_z1': -9593, 'stepper_z2': -9593} -> halt at {'stepper_z': -9594, 'stepper_z1': -9594, 'stepper_z2': -9594}
Probe move: trigger at {'stepper_z': -9595, 'stepper_z1': -9595, 'stepper_z2': -9595} -> halt at {'stepper_z': -9596, 'stepper_z1': -9596, 'stepper_z2': -9596}

The difference between trigger and halt is consistently 1 microstep and is added to the commanded position after each probe move. We accumulated 300 microns of drift by repeated probing and realized something was wrong.

64 microsteps, 3 mm/s probe speed and ENDSTOP_SAMPLE_COUNT = 32.

Probe move: trigger at {'stepper_z': -9594, 'stepper_z1': -9594, 'stepper_z2': -9594} -> halt at {'stepper_z': -9599, 'stepper_z1': -9599, 'stepper_z2': -9599}
Probe move: trigger at {'stepper_z': -9585, 'stepper_z1': -9585, 'stepper_z2': -9585} -> halt at {'stepper_z': -9590, 'stepper_z1': -9590, 'stepper_z2': -9590}
Probe move: trigger at {'stepper_z': -9595, 'stepper_z1': -9595, 'stepper_z2': -9595} -> halt at {'stepper_z': -9599, 'stepper_z1': -9599, 'stepper_z2': -9599}
Probe move: trigger at {'stepper_z': -9593, 'stepper_z1': -9593, 'stepper_z2': -9593} -> halt at {'stepper_z': -9598, 'stepper_z1': -9598, 'stepper_z2': -9598}
Probe move: trigger at {'stepper_z': -9595, 'stepper_z1': -9595, 'stepper_z2': -9595} -> halt at {'stepper_z': -9600, 'stepper_z1': -9600, 'stepper_z2': -9600}
Probe move: trigger at {'stepper_z': -9599, 'stepper_z1': -9599, 'stepper_z2': -9599} -> halt at {'stepper_z': -9604, 'stepper_z1': -9604, 'stepper_z2': -9604}
Probe move: trigger at {'stepper_z': -9599, 'stepper_z1': -9599, 'stepper_z2': -9599} -> halt at {'stepper_z': -9603, 'stepper_z1': -9603, 'stepper_z2': -9603}

The difference between trigger and halt is consistently 4-5 microsteps and is added to the commanded position after each probe move.

Possible Causes

The issue seems related to the ENDSTOP_SAMPLE_TIME * (ENDSTOP_SAMPLE_COUNT - 1) delay in position sampling.
halt_pos and trig_pos is sourced from 2 different places in the pipeline, trig_pos is from the compressed steps, halt_pos is from itersolve (using stepper.get_mcu_position()) and any difference between the two is added to stepper.get_commanded_position(). I haven't dug into the stack enough to know if this is relevant, but maybe any difference between trigger pos and halt pos is already reflected in the system state, so it may not need to be added to the toolhead position at all? In
The endstop oversampling was implemented to address noise in endstop signals on older machines (approximately 7 years ago), and may be better suited as an optional feature.

System Information

Klipper version: master
Stepper motors: 1.8 degree LDO-42STH48-2504AC
Stepper drivers: TMC2209 (also reproduced with TMC5160)
Microstepping: 64
Axis rotation distance: 4 mm
Probing speed: 3 mm/s

Possible Solutions

Investigate whether the code currently applies the difference between halt_pos and trigger_pos to stepper state that already reflects halt_pos and not trigger_pos.
Evaluate the necessity of the current endstop oversampling approach for modern hardware and perhaps make it an opt-in with the caveat that it may produce drift in sequential probing on high resolution axes and/or fast probe speeds.

miklschmidt commented 1 month ago

@KevinOConnor i'm taking the opportunity to tag you as this could potentially have caused many reports about ineffective bed meshing over the years. This issue would produce a virtual twist in the mesh that does not represent the physical world, and the problem would scale with mesh point count.

In our case it's affecting a bunch of our z calibration and compensation routines including our IDEX Z calibration. It is quite severe in some cases.

github-actions[bot] commented 1 month ago

It looks like this ticket is a request for help (or similar). Many helpful people will not see your message here and you are unlikely to get a useful response. Instead, see the contact directions at: https://www.klipper3d.org/Contact.html

We use github to share the results of work done to improve Klipper. We don't use github for requests. (In particular, we don't use github for feature requests, to answer questions, nor to help diagnose problems with a printer.)

Please follow the directions at: https://www.klipper3d.org/Contact.html

This ticket will be automatically closed.

Best regards, ~ Your friendly GitIssueBot

PS: I'm just an automated script, not a human being.

miklschmidt commented 1 month ago

@Sineos I'm gonna assume that was a mistake on your part. Please remove this label, this one is important, affecting a vast number of users. It has been a problem for years, but has always been dismissed because nobody did the needed investigation. I have provided a comprehensive description of the issue here and steps to reliably reproduce it. This is real, i'm not asking for help, i'm reporting a serious fundamental bug in klipper.

Sineos commented 1 month ago

@miklschmidt, I'm pretty sure you know the rules of the Klipper project regarding GH issues, since you've been hanging around here long enough. While I fully agree that this finding, if validated, could indeed be significant, and you certainly have my appreciation for this report, it ultimately does not change anything in terms of general policy.

You are well aware that Kevin actively follows the Discourse, for example, and you are also aware of the Developer category there. So what is the rational/justification for actively disregarding the known policy, and how are you going to tell the next person who raises an issue that their issue is less important than yours?

miklschmidt commented 1 month ago

@Sineos please direct me to this rule that says how a legitimate, substantiated, well researched and well documented bug report with clear reproduction steps does not belong in the project's Github Issues section.

I am aware of how a section of the klipper docs contact page describes bug reporting (https://www.klipper3d.org/Contact.html#i-found-a-bug-in-the-klipper-software), however the intention seems to be to filter out reports that are not actually bugs, which i understand. This is not such as situation, it also does not explicitly state that issues should not be posted on Github, it's just made purposefully inconvenient. As a project maintainer myself, I understand why you'd want to filter out noise, you have my utmost sympathy, but this is not noise.

The developer category itself on Discourse has the following description:

Topics for developers working on the Klipper code, a Klipper “extras” module, a program using the Klipper “api server”, and similar.

It does not mention bug reporting, it doesn't even allude to it. Rightfully so, it's the wrong tool for the job.

Discourse is not for issue tracking, it has no such features, it's a discussion forum. A developers workflow and project management and the software that facilitates it is deeply integrated with Github issues. Blocking that for legitimate bugs is not only harmful to the project, but to the community as a whole.

I have already discussed this with core contributors, who graciously helped me in debugging and locating this bug. However since the endstop and motion timing logic is quite intricate, we agreed that this needs Kevin's attention.

I did not want or expect to have to get into this with you here, it's noise - the exact thing you're trying to avoid. This is pointless policing.

So what is the rational/justification for actively disregarding the known policy, and how are you going to tell the next person who raises an issue that their issue is less important than yours?

First of all, I think this issue sets a good example of how bugs should be reported and dealt with. Second of all, i would very much like to distance myself from this divisive language of "my issue" and who's more important. This bug affects all of us, all klipper users, stop insinuating it's something i'm doing for personal gain. I didn't need to be here, i didn't need to invest the time in reporting this, i did that to contribute to klipper in the spirit of open software.

Sineos commented 1 month ago

I honestly don't understand what you're getting at. The link you quote here contains exactly this directive, once at the beginning of the text and again at the end.

In the end, you should have the honesty to say: I knew the rule, I don't like it and I don't agree with today's process because I think the concern / topic is important enough. Arguing in an agitated manner that you have decided to ignore it seems a bit pathetic.

And to emphasise it again: This is completely independent of the content and potential gravity of the issue. And also again, if it is confirmed, then it is a brilliant contribution to Klipper and my full respect for it.

miklschmidt commented 1 month ago

@Sineos Did you just call me pathetic? Seriously?

My position remains that directing to one place does not prohibit the other place, if you want that to be the case, state clearly and explicitly that creating Github issues is prohibited and turn off issues entirely since they serve no purpose. Consider a proper CONTRIBUTION file explicitly detailing your "policies". None of that one section in the docs currently constitutes a rule which somehow makes it wrong to use the correct tool for the job.

I will not argue with you any further, you're clearly unable to have a civil discussion and you're completely derailing this issue.

Arksine commented 1 month ago

Kevin knows this section of code far better than I, but I may have located the cause for the discrepancy.
It appears that MCU_endstop.home_wait() returns the trigger time, however it does not seem to account for oversampling. Perhaps a patch similar to the following would correct the issue:

diff --git a/klippy/mcu.py b/klippy/mcu.py
index 1122ff865..54aace22a 100644
--- a/klippy/mcu.py
+++ b/klippy/mcu.py
@@ -294,6 +294,7 @@ class MCU_endstop:
         self._home_cmd = self._query_cmd = None
         self._mcu.register_config_callback(self._build_config)
         self._rest_ticks = 0
+        self._oversample_ticks = 0
         self._dispatch = TriggerDispatch(mcu)
     def get_mcu(self):
         return self._mcu
@@ -324,9 +325,11 @@ class MCU_endstop:
         clock = self._mcu.print_time_to_clock(print_time)
         rest_ticks = self._mcu.print_time_to_clock(print_time+rest_time) - clock
         self._rest_ticks = rest_ticks
+        sample_ticks = self._mcu.seconds_to_clock(sample_time)
+        self._oversample_ticks = sample_ticks * (sample_count - 1)
         trigger_completion = self._dispatch.start(print_time)
         self._home_cmd.send(
-            [self._oid, clock, self._mcu.seconds_to_clock(sample_time),
+            [self._oid, clock, sample_ticks,
              sample_count, rest_ticks, triggered ^ self._invert,
              self._dispatch.get_oid(), MCU_trsync.REASON_ENDSTOP_HIT],
             reqclock=clock)
@@ -344,7 +347,9 @@ class MCU_endstop:
             return home_end_time
         params = self._query_cmd.send([self._oid])
         next_clock = self._mcu.clock32_to_clock64(params['next_clock'])
-        return self._mcu.clock_to_print_time(next_clock - self._rest_ticks)
+        return self._mcu.clock_to_print_time(
+            next_clock - self._rest_ticks + self._oversample_ticks
+        )
     def query_endstop(self, print_time):
         clock = self._mcu.print_time_to_clock(print_time)

The above is completely untested, just posted to illustrate what I believe may be causing the issue.

miklschmidt commented 1 month ago

@Arksine Was away for quite a while, dealing with a fever, but i just read through the mcu.py code and this seems like it should do the trick. If i'm not missing something, it would make difference between trig_pos and halt_pos equal on single-mcu probe moves, while still reflecting the "oversteps" caused by the up to 25ms delay in communication on multi-mcu probe moves, right?

Good catch, man! 👏

YanceyA commented 1 month ago

@Sineos Just to round this issue out I have created a klipper discourse topic. Hopefully this bug can get some attention and spur an additional look at the z probing code and logic. In the beacon support forums we see a lot of really odd z probe and offset behavior that doesn't seems to make a lot of sense.

https://klipper.discourse.group/t/bug-accumulating-microstep-shift-during-probe-moves-related-to-endstop-oversampling/19429

Zeanon commented 1 month ago

@miklschmidt, I'm pretty sure you know the rules of the Klipper project regarding GH issues, since you've been hanging around here long enough. While I fully agree that this finding, if validated, could indeed be significant, and you certainly have my appreciation for this report, it ultimately does not change anything in terms of general policy.

You are well aware that Kevin actively follows the Discourse, for example, and you are also aware of the Developer category there. So what is the rational/justification for actively disregarding the known policy, and how are you going to tell the next person who raises an issue that their issue is less important than yours?

Would you be able to create a bug-reports section in the discourse so bug reports can be created where you want to have them and not where they actually belong?

nelsongraca commented 1 month ago

@Sineos so where are we supposed to report bugs? where should PRs be discussed? because all I see is 111 PRs right now and I'm sure a lot of them are stale

miklschmidt commented 1 month ago

This issue seems to have received a lot of attention (for good reason), but there's some confusion around how to tell if you're affected and how to work around it. I have written a reply here that should clarify this: https://github.com/OpenNeptune3D/OpenNept4une/issues/224

Hope it helps people work around this bug until a fix has been pushed!

EllaFoxo commented 1 month ago

Gotta admit, I am so happy to see this getting getting attention. I feel like every time this has come up in discussions on the Discords, I've been gaslit into thinking there is something mechanically wrong with my machine. It's been super hard to pinpoint what exactly is happening, so thank you @miklschmidt for your reporting in such brilliant detail.

May I also suggest that the Klipper authors re-evaluate where bugs are to be reported. Having a searchable and reliable location such as the GitHub issue tracker is invaluable. There is no good reason to hide issue tracking behind a Discord server, or the Discourse forum where there isn't even a category for reporting bugs.

miklschmidt commented 1 month ago

I am no angel when it comes to that gaslighting, i myself have been using the argument "if it was a bug in klipper, everyone would have bed meshing issues" for years. I'm not proud of that.

It took substantial effort and testing to find this bug, it was only because we were developing a z calibration hardware module that kept showing drift over time no matter how many variables we isolated, that i finally realized that something had to be wrong. What i'm trying to say is that i hope people aren't too hard on the people who offer their free time to help and support people in their klipper journey, they were not intentionally gaslighting anyone, there was no obvious indication or reasonable way to tell that the bug was deeply embedded in klipper, and without the help of smarter people, i would probably have given up before it got to this point.

So be nice to your fellow klipper supporter, they're doing the best they can!

Arksine commented 1 month ago

If i'm not missing something, it would make difference between trig_pos and halt_pos equal on single-mcu probe moves, while still reflecting the "oversteps" caused by the up to 25ms delay in communication on multi-mcu probe moves, right?

I believe so. Today I was able to reproduce the issue locally and confirm that my suggested patch seems to resolve it the deviation. I created a PR (linked above) with the fix if anyone else would like to give it a test and confirm that it works.

Zeanon commented 1 month ago

I tested it Deviation seems to be gone

KevinOConnor commented 1 month ago

The github issue tracker is no longer used for any purpose. It's certainly possible there is a defect in the homing code. I'm happy to discuss that code and/or to apply fixes to it. However, that conversation will need to move to one of the supported forums (as described at https://www.klipper3d.org/Contact.html ).

The Discourse thread or PR mentioned above seem like good candidates.

-Kevin

Klipper3d / klipper

[BUG] Accumulating microstep shift during probe moves related to endstop oversampling. #6711

Microstep Shift in Z-Axis Probing

Issue Description

Reproduction Steps

Observed Behavior

Expected Behavior

Additional Observations

Relevant Code

Sample Data

Possible Causes

System Information

Possible Solutions