Closed miklschmidt closed 1 month ago
@KevinOConnor i'm taking the opportunity to tag you as this could potentially have caused many reports about ineffective bed meshing over the years. This issue would produce a virtual twist in the mesh that does not represent the physical world, and the problem would scale with mesh point count.
In our case it's affecting a bunch of our z calibration and compensation routines including our IDEX Z calibration. It is quite severe in some cases.
It looks like this ticket is a request for help (or similar). Many helpful people will not see your message here and you are unlikely to get a useful response. Instead, see the contact directions at: https://www.klipper3d.org/Contact.html
We use github to share the results of work done to improve Klipper. We don't use github for requests. (In particular, we don't use github for feature requests, to answer questions, nor to help diagnose problems with a printer.)
Please follow the directions at: https://www.klipper3d.org/Contact.html
This ticket will be automatically closed.
Best regards, ~ Your friendly GitIssueBot
PS: I'm just an automated script, not a human being.
@Sineos I'm gonna assume that was a mistake on your part. Please remove this label, this one is important, affecting a vast number of users. It has been a problem for years, but has always been dismissed because nobody did the needed investigation. I have provided a comprehensive description of the issue here and steps to reliably reproduce it. This is real, i'm not asking for help, i'm reporting a serious fundamental bug in klipper.
@miklschmidt, I'm pretty sure you know the rules of the Klipper project regarding GH issues, since you've been hanging around here long enough. While I fully agree that this finding, if validated, could indeed be significant, and you certainly have my appreciation for this report, it ultimately does not change anything in terms of general policy.
You are well aware that Kevin actively follows the Discourse, for example, and you are also aware of the Developer category there. So what is the rational/justification for actively disregarding the known policy, and how are you going to tell the next person who raises an issue that their issue is less important than yours?
@Sineos please direct me to this rule that says how a legitimate, substantiated, well researched and well documented bug report with clear reproduction steps does not belong in the project's Github Issues section.
I am aware of how a section of the klipper docs contact page describes bug reporting (https://www.klipper3d.org/Contact.html#i-found-a-bug-in-the-klipper-software), however the intention seems to be to filter out reports that are not actually bugs, which i understand. This is not such as situation, it also does not explicitly state that issues should not be posted on Github, it's just made purposefully inconvenient. As a project maintainer myself, I understand why you'd want to filter out noise, you have my utmost sympathy, but this is not noise.
The developer category itself on Discourse has the following description:
Topics for developers working on the Klipper code, a Klipper “extras” module, a program using the Klipper “api server”, and similar.
It does not mention bug reporting, it doesn't even allude to it. Rightfully so, it's the wrong tool for the job.
Discourse is not for issue tracking, it has no such features, it's a discussion forum. A developers workflow and project management and the software that facilitates it is deeply integrated with Github issues. Blocking that for legitimate bugs is not only harmful to the project, but to the community as a whole.
I have already discussed this with core contributors, who graciously helped me in debugging and locating this bug. However since the endstop and motion timing logic is quite intricate, we agreed that this needs Kevin's attention.
I did not want or expect to have to get into this with you here, it's noise - the exact thing you're trying to avoid. This is pointless policing.
So what is the rational/justification for actively disregarding the known policy, and how are you going to tell the next person who raises an issue that their issue is less important than yours?
First of all, I think this issue sets a good example of how bugs should be reported and dealt with. Second of all, i would very much like to distance myself from this divisive language of "my issue" and who's more important. This bug affects all of us, all klipper users, stop insinuating it's something i'm doing for personal gain. I didn't need to be here, i didn't need to invest the time in reporting this, i did that to contribute to klipper in the spirit of open software.
I honestly don't understand what you're getting at. The link you quote here contains exactly this directive, once at the beginning of the text and again at the end.
In the end, you should have the honesty to say: I knew the rule, I don't like it and I don't agree with today's process because I think the concern / topic is important enough. Arguing in an agitated manner that you have decided to ignore it seems a bit pathetic.
And to emphasise it again: This is completely independent of the content and potential gravity of the issue. And also again, if it is confirmed, then it is a brilliant contribution to Klipper and my full respect for it.
@Sineos Did you just call me pathetic? Seriously?
My position remains that directing to one place does not prohibit the other place, if you want that to be the case, state clearly and explicitly that creating Github issues is prohibited and turn off issues entirely since they serve no purpose. Consider a proper CONTRIBUTION file explicitly detailing your "policies". None of that one section in the docs currently constitutes a rule which somehow makes it wrong to use the correct tool for the job.
I will not argue with you any further, you're clearly unable to have a civil discussion and you're completely derailing this issue.
Kevin knows this section of code far better than I, but I may have located the cause for the discrepancy.
It appears that MCU_endstop.home_wait()
returns the trigger time, however it does not seem to account for oversampling. Perhaps a patch similar to the following would correct the issue:
diff --git a/klippy/mcu.py b/klippy/mcu.py
index 1122ff865..54aace22a 100644
--- a/klippy/mcu.py
+++ b/klippy/mcu.py
@@ -294,6 +294,7 @@ class MCU_endstop:
self._home_cmd = self._query_cmd = None
self._mcu.register_config_callback(self._build_config)
self._rest_ticks = 0
+ self._oversample_ticks = 0
self._dispatch = TriggerDispatch(mcu)
def get_mcu(self):
return self._mcu
@@ -324,9 +325,11 @@ class MCU_endstop:
clock = self._mcu.print_time_to_clock(print_time)
rest_ticks = self._mcu.print_time_to_clock(print_time+rest_time) - clock
self._rest_ticks = rest_ticks
+ sample_ticks = self._mcu.seconds_to_clock(sample_time)
+ self._oversample_ticks = sample_ticks * (sample_count - 1)
trigger_completion = self._dispatch.start(print_time)
self._home_cmd.send(
- [self._oid, clock, self._mcu.seconds_to_clock(sample_time),
+ [self._oid, clock, sample_ticks,
sample_count, rest_ticks, triggered ^ self._invert,
self._dispatch.get_oid(), MCU_trsync.REASON_ENDSTOP_HIT],
reqclock=clock)
@@ -344,7 +347,9 @@ class MCU_endstop:
return home_end_time
params = self._query_cmd.send([self._oid])
next_clock = self._mcu.clock32_to_clock64(params['next_clock'])
- return self._mcu.clock_to_print_time(next_clock - self._rest_ticks)
+ return self._mcu.clock_to_print_time(
+ next_clock - self._rest_ticks + self._oversample_ticks
+ )
def query_endstop(self, print_time):
clock = self._mcu.print_time_to_clock(print_time)
The above is completely untested, just posted to illustrate what I believe may be causing the issue.
@Arksine Was away for quite a while, dealing with a fever, but i just read through the mcu.py code and this seems like it should do the trick. If i'm not missing something, it would make difference between trig_pos
and halt_pos
equal on single-mcu probe moves, while still reflecting the "oversteps" caused by the up to 25ms delay in communication on multi-mcu probe moves, right?
Good catch, man! 👏
@Sineos Just to round this issue out I have created a klipper discourse topic. Hopefully this bug can get some attention and spur an additional look at the z probing code and logic. In the beacon support forums we see a lot of really odd z probe and offset behavior that doesn't seems to make a lot of sense.
@miklschmidt, I'm pretty sure you know the rules of the Klipper project regarding GH issues, since you've been hanging around here long enough. While I fully agree that this finding, if validated, could indeed be significant, and you certainly have my appreciation for this report, it ultimately does not change anything in terms of general policy.
You are well aware that Kevin actively follows the Discourse, for example, and you are also aware of the Developer category there. So what is the rational/justification for actively disregarding the known policy, and how are you going to tell the next person who raises an issue that their issue is less important than yours?
Would you be able to create a bug-reports section in the discourse so bug reports can be created where you want to have them and not where they actually belong?
@Sineos so where are we supposed to report bugs? where should PRs be discussed? because all I see is 111 PRs right now and I'm sure a lot of them are stale
This issue seems to have received a lot of attention (for good reason), but there's some confusion around how to tell if you're affected and how to work around it. I have written a reply here that should clarify this: https://github.com/OpenNeptune3D/OpenNept4une/issues/224
Hope it helps people work around this bug until a fix has been pushed!
Gotta admit, I am so happy to see this getting getting attention. I feel like every time this has come up in discussions on the Discords, I've been gaslit into thinking there is something mechanically wrong with my machine. It's been super hard to pinpoint what exactly is happening, so thank you @miklschmidt for your reporting in such brilliant detail.
May I also suggest that the Klipper authors re-evaluate where bugs are to be reported. Having a searchable and reliable location such as the GitHub issue tracker is invaluable. There is no good reason to hide issue tracking behind a Discord server, or the Discourse forum where there isn't even a category for reporting bugs.
I am no angel when it comes to that gaslighting, i myself have been using the argument "if it was a bug in klipper, everyone would have bed meshing issues" for years. I'm not proud of that.
It took substantial effort and testing to find this bug, it was only because we were developing a z calibration hardware module that kept showing drift over time no matter how many variables we isolated, that i finally realized that something had to be wrong. What i'm trying to say is that i hope people aren't too hard on the people who offer their free time to help and support people in their klipper journey, they were not intentionally gaslighting anyone, there was no obvious indication or reasonable way to tell that the bug was deeply embedded in klipper, and without the help of smarter people, i would probably have given up before it got to this point.
So be nice to your fellow klipper supporter, they're doing the best they can!
If i'm not missing something, it would make difference between trig_pos and halt_pos equal on single-mcu probe moves, while still reflecting the "oversteps" caused by the up to 25ms delay in communication on multi-mcu probe moves, right?
I believe so. Today I was able to reproduce the issue locally and confirm that my suggested patch seems to resolve it the deviation. I created a PR (linked above) with the fix if anyone else would like to give it a test and confirm that it works.
I tested it Deviation seems to be gone
The github issue tracker is no longer used for any purpose. It's certainly possible there is a defect in the homing code. I'm happy to discuss that code and/or to apply fixes to it. However, that conversation will need to move to one of the supported forums (as described at https://www.klipper3d.org/Contact.html ).
The Discourse thread or PR mentioned above seem like good candidates.
-Kevin
Microstep Shift in Z-Axis Probing
Issue Description
During Z-axis probing operations, there's a consistent microstep difference between the trigger position and the halt position that should not be present, depending on driver microstep settings and probing speed. This difference is added to the commanded stepper position which causes it to accumulate over multiple probe moves, resulting in inaccurate bed meshes (with probes that require movement in Z) and potentially affecting first layer consistency and printed geometry.
Reproduction Steps
relevant code
section to log the microstep difference.PROBE
and optionallyGET_POSITION
at the same x/y position).grep "Probe move" ~/printer_data/logs/klippy.log
to show the step difference, andgrep "Setting toolhead position" ~/printer_data/logs/klippy.log
, to show the continuous drift.Note: this may require a sufficiently high resolution z-axis or faster probe speeds to reproduce, this bug is consistently reproduced with TR8*4 leadscrews (a rotation distance of 4mm), 1.8 degree steppers, 64 microsteps and 3 mm/s of probing speed. Anything resulting in more steps generated pr sec will show the issue. The microstep shift scales with steps per second.
Observed Behavior
Expected Behavior
Trigger and halt positions should align without a consistent shift during probing operations. Unsure about multi-mcu setups, but the issue would be present there as well.
Additional Observations
Relevant Code
The microstep difference was logged by applying the following patch:
Sample Data
With 64 microsteps and 3 mm/s probe speed, the following log output demonstrates the issue:
The difference between trigger and halt is consistently 1 microstep and is added to the commanded position after each probe move. We accumulated 300 microns of drift by repeated probing and realized something was wrong.
64 microsteps, 3 mm/s probe speed and
ENDSTOP_SAMPLE_COUNT = 32
.The difference between trigger and halt is consistently 4-5 microsteps and is added to the commanded position after each probe move.
Possible Causes
ENDSTOP_SAMPLE_TIME * (ENDSTOP_SAMPLE_COUNT - 1)
delay in position sampling.stepper.get_mcu_position()
) and any difference between the two is added tostepper.get_commanded_position()
. I haven't dug into the stack enough to know if this is relevant, but maybe any difference between trigger pos and halt pos is already reflected in the system state, so it may not need to be added to the toolhead position at all? InSystem Information
master
Possible Solutions