jogging issues with UGS and some possible fixes

treib commented 3 years ago

Hello everyone -

On the latest versions of Grbl_Esp32 (Ver 1.3a Date 20201212) and UGS (2.0.7), I'm seeing a jogging issue where releasing from a continuous jog sometimes results in the jog continuing and hitting the endstop. Anyone else see this behavior? Mainly it happens when the jog feed rate in UGS is set much higher than the axis max velocity.

I’m new to the codebase, but I looked into it and made a couple changes that fixed it for me. I am not comfortable the changes I made are the best solution, though, so rather than doing a pull request I wanted to explain what I found in case it would be helpful to anyone.

You are welcome to use parts or all of it, or I can refine it if you have suggestions on how it might better fit into the project.

My changes are in the fork: https://github.com/treib/Grbl_Esp32

I believe I’ve tracked it down to a couple things:

Race condition setting sys_rt_exec_state

A race condition between the Serial.cpp task changing the sys_rt_exec_state global and the main program. I suspect the main hit point is in protocol_auto_cycle_start() where it sets the cycleStart bit, but likely any place where a bit is being set in the main program is risky since I suspect the underlying bitwise OR and store is non-atomic.

This is a rare hit, maybe 1% failure rate, but I observed it in the real world. Strategically placed debug strings confirmed that the command was received, but the bit ultimately was lost.

I tested a fix where Serial.cpp stores a task-local version of sys_rt_exec_state, and it is fetched at the start of protocol_exec_rt_system() using a getter function protected with vTaskEnterCritical(). This works, but could require modification of custom modules to call this explicit getter rather than assume these values will change on their own. (I’m thinking of modules like CoreXY.cpp).

Cancel/hold misses move that is in-flight in mc_line()

Another unrelated cause of the jogging problem appears to be more of a grbl issue: When a jogCancel or feedHold arrives, the planner is emptied and gcode parser state resets to current position, which is good. But a next jog command could already be in-flight at this point, waiting in mc_line() as the machine decelerates, and with target position determined based on gcode that has been dropped due to the hold/cancel.

UGS causes this easily by sending a continuing series of incremental jog commands while holding down a jog direction button. These accumulate the target position to a large number, and since a jogCancel or feedHold during State::Jog results in going back to State::Idle, that stuck in-flight move gets added to the planner with the large target value intact. The machine then moves and hits the endstop. This is much easier to reproduce (happens maybe 30% of the time for me), and is especially obvious and problematic when UGS is set to a jog feedrate much larger than the axis can support (since that in-flight move gets a much larger and easily out-of-bounds target position).

My simple fix for this is to create a global variable that tracks an in-flight command in mc_line(). In protocol_exec_rt_suspend() I look for sys.suspend.bit.jogCancel being true, and if the in-flight command is a jog, I set a flag that mc_line() will notice and not add it to the planner. I added an is_jog flag to plan_line_data_t to track whether it’s a jog, and a return bool from mc_line() to signal it dropped the command so that jog_execute() can signal back to gc_execute_line() that it was dropped so it can skip the updating of gc_state.position.

I did not touch the kinematic path to mc_line(), so I am not sure if those could still have issues.

My main concern is this will only catch these in-flight moves if protocol_execute_realtime() is called when the move has entered mc_line(). Perhaps some scenarios or custom code could call protocol_execute_realtime() after a gcode line was parsed but before getting to mc_line(). But I guess in that case it is no different than how it behaves without the fix.

UGS is a little unreliable too

I should mention that I’ve also seen UGS not send the feedHold when releasing a jog button. My solution to this problem is to stop using UGS. :)

Anyway, I hope some of this will be helpful. Thanks for all your hard work on this great project!

Phil

bdring commented 3 years ago

Thanks, can you make a P.R.? it will be much easier to evaluate in that form.

MitchBradley commented 3 years ago

I think it will help to test and review if we break these down in to separate PRs. Let's start with the race condition first, because I have recently done a deep dive into protocol_exec_rt_system, so it is reasonably fresh in my mind.

You may have noticed that the variable cycle_stop is separate from the bitfields in sys_rt_exec_state. That was not always the case. We had a race problem with .bit.CycleStop . We fixed it by making cycle_stop into a separate boolean, which can be set atomically with a single byte store, not requiring a read/or/write sequence. A similar change is likely to work for .bit.CycleStart => cycle_start . The one disadvantage of such a change is that you can no longer test all the bits in one go with rt_exec_state.value != 0; instead you would have to say if (rt_exec_state.value != 0 || cycle_stop || cycle_stop) { . That is probably not a problem, considering the advantages you get by being able to set and clear the flag with a single store compared to the explicitly-locked read/modify/write sequences.

treib commented 3 years ago

Hi Mitch, thanks - I didn't know the history of cycle_stop. That's helpful to know.

I was just just about to submit a PR for the getter method I used, but I'm thinking that breaking out into separate bools might be the better approach. It involves touching a lot more code, but what concerns me is whether there might ever be a need for a separate task (not Serial.cpp or the main program) to monitor or change those flags. It doesn't seem any existing code falls into that scenario (CoreXY.cpp uses the flags in its own loop for homing, but is called on the main program). Still, it seems like it could maybe happen down the road.

I'll put together a separate PR for breaking out into separate bools and we can look at whether that's the best approach.

MitchBradley commented 3 years ago

Ideally the flags are only tested/cleared in one place, namely protocol_exec_rt_system. Some of them are set from several places - 6 places for cycle_start . I don't think that setting a flag that is already set is a problem.

The "only one test/clear place" rule is violated by the homing code in Limits.cpp and the copy of that code in CoreXY.cpp. That is troublesome - and indeed it stalled my attempt to convert protocol_exec_rt_system into a proper event-queue-driven state machine. But it doesn't involve cycle_start.

treib commented 3 years ago

Thanks! - I have broken these into two pull requests: #716 and #718

terjeio commented 3 years ago

My fix was to flush the serial buffer and the line input buffer on receipt of a jog cancel command. Code can be found in this patch.

bdring / Grbl_Esp32