Coordinate system discrepancy in preview after abort

zultron commented 8 years ago

This bug, discussed on emc-developers, hasn't yet been resolved.

Here are the steps I follow to reproduce the issue:

(See the emc-developers thread for details)

With machine in G54, load and run nc_files/systems.ngc
Stop the program while on the second label, where G55 is active
Preview DROs and origin will be in G55, but mode line and motion will be in G56
This is what I expected to happen:

Mode line and motion should all be in G55, matching preview

This is what happened instead:

Mode line and motion were in G56

It worked properly before this:

This affects the 2.6 branch

Other info:

@SebKuzminsky wrote a unit test demonstrating the bug
@jepler confirmed the bug in 2.6 and 2.7 branches
@SebKuzminsky wrote a nice analysis of the problem (although IIUC his patch in #67 isn't yet ready)
This seems to be related to #49
This may be a duplicate of #126, "coordinate system changed to G54 on Abort and program end"

zultron commented 8 years ago

Actually, the behavior is a little bit more complex than this. Load up nc_files/systems.ngc, and start it. At the moment the G55 label starts cutting, the mode line will read G55, but will update to G56 just a moment later. If the program is stopped at this moment, the status is even stranger than we thought:

Preview reads in G55: DROs, origin
Other parts in G56: Motion (G0 X0Y0), mode line

It seems that there's actually almost no cleanup on abort other than a block-level reset in interp. The mystery is how these parts get out of sync.

SebKuzminsky commented 8 years ago

Hi @zultron, my patches in #67 were merged into 2.6 (and merged up from there into 2.7 and master), but as you say they were not ready - we discovered some fairly serious problems and I reverted them in 2.6 and 2.7 (though they're still in master).

I just returned from a vacation and I too intend to look back into the problems caused by the partial fix.

I think that branch (as amended by fixup commits by @jepler and me; as represented by the current state of the master branch) is on the right path: improve Interp's handling of Abort to do more thorough cleanup, and improve Task's handling of Abort to not discard the Canon cleanup that Interp needs to do.

zultron commented 8 years ago

@SebKuzminsky, welcome back!

I found that the state tags branch takes care of this issue, since the problem stems from interp running ahead of motion.

The original authors intended a pretty light touch on machine state after an abort: between Interp::on_abort() and Interp::reset(), only some block-level cleanup is done in interp. By comparison, simulating M02 on abort is pretty heavy-handed. Hopefully, if that stays in master, if/when the state tags work is merged, there won't be objections to enabling/disabling the M02 behavior from a .ini file setting.

SebKuzminsky commented 8 years ago

Yeah, that "Abort implies M2" behavior is way too heavy handed, and is the cause of some of the bugs that cropped up as a result of that work.

My intent was to have Interp clean up Canon state on Abort, since that's where I perceive the problem to be (since that's where the active coordinate system is tracked). I'm glad to hear the state tags work resolves the issue, but what do you mean when you say the problem stems from Interp running ahead of Motion? Surely Motion doesn't know or care what coordinate system is active, and the coordinate system to return to on Abort is the one Interp had active before it started running the current gcode program (or mdi).

If I've misunderstood the cause of the problem I'd love to be educated so i can fix it right! :-)

If I'm still on the right track, I obviously (in retrospect) overshot my mark and Interp now cleans up way too much on Abort. I need to rein it in -- the current master behavior will not be present in 2.7+1 (with or without an ini setting).

zultron commented 8 years ago

@SebKuzminsky, you're right, motion doesn't know about coordinate systems, but real-time program execution is limited by how fast motion plays out trajectories, whereas interp can queue up commands as fast as it can until hitting either EOF or a queue buster. Running the example from the first comment, nc_files/systems.ngc, you can see interp and canon getting out of synch: the modeline coordinate system, following interp state, runs faster than preview, which follows canon state.

@robEllenberg's state tags work makes the assumption that canon state, which matches the real-time program execution on the physical machine, is the correct state, and after an abort, it is rather interp's state, likely stopped at an arbitrary future point in the program, needs to be reset back to the state of the program where execution was stopped.

That assumption makes sense to me, since it's more likely that a program can be debugged or even restarted if abort-time machine state is preserved. Others have said they expect that behavior from their experience with other controllers, but I don't have that kind of experience myself.

robEllenberg commented 8 years ago

@zultron I agree, with a slight correction that the correct state (i.e. active from the user's perspective) is the one currently executing in motion. Canon state matches the interpreter state, because the last step in interpreting most program lines is to call one or more canon commands. Canon does have some internal state, but conceptually, it's just a translation layer for the interpreter.

Think of state tags as a generalization of line number. Motion knows the line number of a particular command. It doesn't do much with this information, other than report it to emcStatus (and check it when single-stepping). So, all we do with state tags is stuff more information into this field. Now, instead of just reporting the current line, motion presents the entire tag to emcStatus. It doesn't need to know what's inside. All it's saying is "my current motion matches this tag". It is task's responsibility to actually understand the tag's contents when it unpacks the state into active G codes.

Keep in mind that all of what I've said so far is independent of the abort problem. When you actually abort, motion no longer has an active segment, so Task has to get its state from the interpreter. If the program runs to completion, there's no discontinuity (since motion has caught up with the interpreter). However, if we abort early on, then the active state reported will suddenly jump from the aborted line, to the read-ahead line. A user who doesn't know about lookahead will at best be surprised, and at worst do damage.

Here is where the state-tag-restore idea comes in. We want to make the active G code state match the motion line where the program was aborted. Poking directly into the interpreter / canon is asking for trouble. Instead, the restore_from_tag function imitates what restore_settings() already does. That is, construct a sequence of G code that, when executed, will put the interpreter in the state we want. The interpreter is actually interpreting G code commands here, so there's less danger of inconsistent state between interp and canon.

LinuxCNC / linuxcnc