return from remap does not bust queue

rene-dev commented 5 years ago

Here are the steps I follow to reproduce the issue:

1. after restore in line 62 in configs/sim/axis/remap/manual-toolchange-with-tool-length-switch/nc_subroutines/manual_change.ngc in add any gcode, for example g0 x0 2. run configs/sim/axis/remap/manual-toolchange-with-tool-length-switch/manualtoolchange.ini 3. enable, home, and run program 4. click change complete when it waits for the toolchange. Do not press simulate contact, wait for it to fail with -2 5. g0 x0 is not executed. inside the if, only the return is executed. 6. code is executed, if you add m66 p0 l0 a queue buster before the return.

without this, it is not possible to recover from toolchanger failures, or put the machine in a safe state after a toolchange failure.

This is what I expected to happen:

code to be run before return.

This is what happened instead:

code not executed, return returns during readahead.

It worked properly before this:

I dont think this ever worked. I tried 2.7 and 2.8-pre

Information about my hardware and software:

2.8-pre

andypugh commented 5 years ago

I think this also causes #579

rene-dev commented 5 years ago

I dont think its related, it also happens without ON_ABORT_COMMAND.

rene-dev commented 5 years ago

https://github.com/LinuxCNC/linuxcnc/blob/master/src/emc/rs274ngc/interp_python.cc#L21

zultron commented 5 years ago

I ran your procedure to reproduce this error. Thanks for providing the clear steps: with the 'click change complete' and other details, I was able to run it with DEBUG=2147483647 and see what was happening pretty quickly.

Interp::read:|    G0 X10|
[...]
emcTaskPlanCommand(    G0 X10) called. (line_number=63)
execute:auto line='    G0 X10' mdi_int=0 o_type=O_none o_name=(null) cl=1 rl=1 type=*unset* state=CS_NORMAL
NML_INTERP_LIST(0x5624cf2c7a00)::append(nml_msg_ptr{size=128,type=EMC_TRAJ_LINEAR_MOVE}) : list_size=1, line_number=63

The G0 X10 command went onto the queue. So far, so good.

[...]
Interp::read:|O300    return [-2] ; indicate probe contact failure to epilog|
In: read_o line:-1 |o300return[-2]; indicate probe contact failure to epilog| subroutine=|300|
global case:|300|
o_type:O_return o_name: 300  line:-1 o300return[-2]; indicate probe contact failure to epilog
return 300 value -2.000000
[...]
emcTaskPlanCommand(O300    return [-2] ; indicate probe contact failure to epilog) called. (line_number=64)
execute:auto line='O300    return [-2] ; indicate probe contact failure to epilog' mdi_int=0 o_type=O_return o_name=300 cl=1 rl=1 type=*unset* state=CS_NORMAL
convert_control_functions O_return
execute_return manual_change type=CT_REMAP state=CS_NORMAL
pycall(remap.change_epilog) 

unwind_call: call_level=1 status=INTERP_ERROR - error: 5 from execute emc/rs274ngc/rs274ngc_pre.cc:523
unwind_call leaving sub ''
unwind_call: reopening './nc_files/tcdemo.ngc' at 10
unwind_call: setting sequence number=0 from frame 1
unwind_call: exiting current sub 'manual_change'

emc/task/emctask.cc 405: interp_error: M6 aborted (return code -2.0)
M6 aborted (return code -2.0)
Interpreter stack:   - Python  - int Interp::handler_returned(setup_pointer, context_pointer, const char*, bool)  - int Interp::execute_return(setup_pointer, context_pointer, int)  - int Interp::convert_control_functions(block_pointer, setup_pointer)  - int Interp::_execute(const char*) 
[...]

The O300 return [-2] line (should really be o<manual_change> return [-2]) returns from the subroutine signaling an error (negative return code), and interp aborts.

NML_INTERP_LIST(0x5624cf2c7a00)::clear(): discarding 1 items
on_abort reason=9 message='interpreter error'
unwind_call: call_level=0 status=INTERP_OK - 0 from reset emc/rs274ngc/rs274ngc_pre.cc:1758
Interp::read:[cmd]:|O <on_abort> call [9]|
In: read_o line:-1 |o<on_abort>call[9]| subroutine=|on_abort|
global case:|on_abort|

As part of the abort, the interp list is cleared, explaining why the G0 X10 command disappeared.

This actually all looks exactly as we would expect when a normal, non-remapped block returns an error: the error triggers an abort, which for safety stops motion and clears the interp queue (including any queued motion commands).

I think this issue is about the counter-intuitive behavior resulting from the M6 NGC subroutine returning a negative result to signal an error in the remapped block. We expect the subroutine to execute everything queued up before the end of the sub, and only abort anything following the Q300 return. Instead, the abort happens only after the last queue buster, which can be many blocks before the return [-2] block.

As you show, adding the M66 block before the return [-2] is a good workaround.

A proper fix in the remap code would catch the negative return code coming out of a failed remap and bust the queue before returning the error. I think this shouldn't happen after a successful M6 remap, but there could be an argument for that.

Look at nc_files/remap_lib/python-stdglue/stdglue.py where change_epilog() reports the error and yields INTERP_ERROR around line 198. It might be possible to add a yield INTERP_EXECUTE_FINISH before that. A few years ago I fixed a problem where python remaps barfed after yielding too many times, but I'm not sure if it has been very rigorously exercised, so YMMV.

It would be great to have a PR with an accompanying test.

zultron commented 4 years ago

Looking at this again, it seems the target for the queue buster shouldn't be ALL remap commands, just the M6. Hack in special logic here?

https://github.com/LinuxCNC/linuxcnc/blob/2.8/src/emc/rs274ngc/interp_convert.cc#L3079-L3091

LinuxCNC / linuxcnc