Race condition between kinematics setup and emcMotionInit

snowgoer540 commented 1 year ago

I am creating this issue log at the request of Petter here: https://github.com/LinuxCNC/linuxcnc/commit/9d79c2aa23e6e6762dfc32b4ad7c4b7d219e9ea4#commitcomment-129891775

A race condition exists upon startup of LinuxCNC. It was made more noticeable by the recent taskio commits (7ebe7d9 / 9d79c2a), which I posit is due to the faster loading times. When this happens, the GUI/preview components are populated before the kinematics setup has been processed. For some screens, this precludes loading into "world view" (X Y Z DRO instead of 0 1 2 DRO), for others it causes crashes.

Here are the steps I follow to reproduce the issue:

This can be seen by starting a simple config of QtPlasmaC. The screen should load showing the material parameters (via the modified world view DRO), and instead loads with Joints DRO.

Correct:

Incorrect:

For other GUIs, such as QtDragon (not QtDragon HD), the GUI just crashes.

For Axis, there is a GUI-level work around, but it does not always work (more on this later)

I think the reason is because emcMotionInit is being called before the kinematics setup processing is completed. Below are two dumps from the terminal, one from a good run (commit 8e8fce8 checked out - this is the last commit where this worked) and the other a bad run from current master. Take note of the location of ">>>taskintf.cc called emcMotionInit<<<". I placed that print command at the beginning of the emcMotionTaskInit function directly above the line here:

https://github.com/LinuxCNC/linuxcnc/blob/c123203ab1d2da5ac9115892ca757635c6fd91a6/src/emc/task/taskintf.cc#L1675

good run (worth noting that self.no_joint_display is always True in this case):

((HEAD detached at 8e8fce8135)) mx:src $ ~/linuxcnc-dev/scripts/linuxcnc
LINUXCNC - 2.9.0~pre1
Machine configuration directory is '/home/plasma/linuxcnc-dev/configs/sim/qtplasmac'
Machine configuration file is 'qtplasmac_l_imperial.ini.expanded'
Starting LinuxCNC...
linuxcnc TPMOD=tpmod HOMEMOD=homemod EMCMOT=motmod
Note: Using POSIX non-realtime
twopass:invoked with <> options
twopass:found ./sim_no_stepgen.tcl
twopass:found /home/plasma/linuxcnc-dev/lib/hallib/qtplasmac_comp.hal
twopass:found ./custom.hal

identityKinematicsSetup: coordinates:xyyz
   Joint 0 ==> Axis X
   Joint 1 ==> Axis Y
   Joint 2 ==> Axis Y
   Joint 3 ==> Axis Z
identityKinematicsSetup: Recommend: kinstype=both

>>>taskintf.cc called emcMotionInit<<<
QTVCP MAIN FILE __init__
QtPlasmaC handler __init__
glcanon self.no_joint_display True: self.stat.kinematics_type 1 == linuxcnc.KINEMATICS_IDENTITY 1
glcanon self.no_joint_display True: self.stat.kinematics_type 1 == linuxcnc.KINEMATICS_IDENTITY 1
QTVCP MAIN FILE __init__

bad run (worth noting that self.no_joint_display is always False in this case):

(master) mx:src $ ~/linuxcnc-dev/scripts/linuxcnc
LINUXCNC - 2.10.0~pre0
Machine configuration directory is '/home/plasma/linuxcnc-dev/configs/sim/qtplasmac'
Machine configuration file is 'qtplasmac_l_imperial.ini.expanded'
Starting LinuxCNC...
linuxcnc TPMOD=tpmod HOMEMOD=homemod EMCMOT=motmod
Note: Using POSIX non-realtime
>>>taskintf.cc called emcMotionInit<<<
twopass:invoked with <> options
twopass:found ./sim_no_stepgen.tcl
twopass:found /home/plasma/linuxcnc-dev/lib/hallib/qtplasmac_comp.hal
twopass:found ./custom.hal

identityKinematicsSetup: coordinates:xyyz
   Joint 0 ==> Axis X
   Joint 1 ==> Axis Y
   Joint 2 ==> Axis Y
   Joint 3 ==> Axis Z
identityKinematicsSetup: Recommend: kinstype=both

QTVCP MAIN FILE __init__
QtPlasmaC handler __init__
glcanon self.no_joint_display False: self.stat.kinematics_type 0 == linuxcnc.KINEMATICS_IDENTITY 1
glcanon self.no_joint_display False: self.stat.kinematics_type 0 == linuxcnc.KINEMATICS_IDENTITY 1
USRMOT: ERROR: command 30 timeout
emcMotionInit: emcTrajInit failed
>>>taskintf.cc called emcMotionInit<<<
QTVCP MAIN FILE __init__

For the good run, notice that ">>>taskintf.cc called emcMotionInit<<<" happens after the kinematics are set up (the specific print is L319 of src/emc/kins_util.c), however for the bad run, ">>>taskintf.cc called emcMotionInit<<<" happens in the middle of the kinematics setup. It's interesting to note that the number of times emcMotionInit gets called seems to differ.

The last thing is that this happens much less frequently in Axis and the reason, it seems is that Axis specifically checks to see if the kinematics are set up before proceeding here:

https://github.com/LinuxCNC/linuxcnc/blob/c123203ab1d2da5ac9115892ca757635c6fd91a6/src/emc/usr_intf/axis/scripts/axis.py#L3560

Note that for some users that have been helping me test this, Axis does still sometimes fail after the prescribed number of loops expire. Similar checking could be done to QtVCP screens, and testing shows it does work. But this seems like a Band-Aid to me, and it does not eliminate the root cause.

It seems like the kinematics setup should be done before the screen building can be allowed to begin (or before emcMotionInit is called (or maybe this is part of the screen build). I am a little fuzzy as to the order of of load operations here and what need to wait for what. But it doesn't seem to be quite ironed out.

This is what I expected to happen:

GUIs should load after kinematics processing is completed to ensure joints and axes, etc. are all populated correctly as the GUI and the preview screen makes decisions based on these populations.

This is what happened instead:

It rarely completes the kinematic setup in time, so best case there are DRO type issues (with perhaps other unforeseen consequences), and worst case GUIs fail to load.

It worked properly before this:

Commit 8e8fce8 is the last commit where this worked much more reliably. It's important to note that this was still an issue, just much less likely to be encountered.

Information about my hardware and software:

I am using this Linux distribution and version (often, shown by lsb_release -a):

No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:    10
Codename:   buster

This has been tested Buster to Bookworm, all perform the same.

I am using this kernel version (shown by uname -a):

Linux mx 4.19.0-25-rt-amd64 #1 SMP PREEMPT RT Debian 4.19.289-2 (2023-08-08) x86_64 GNU/Linux

I am running ...
- [ ] A binary version from linuxcnc.org (including buildbot.linuxcnc.org)
- [ ] A binary I built myself
- [ ] A binary version from some other source besides linuxcnc.org
- [x] RIP from master
I am using this LinuxCNC version (shown in package manager or, for git versions, scripts/get-version-from-git): master
I am using this user interface (GUI) (e.g., AXIS, Touchy, gmoccapy, etc): AXIS, plasmac2, all QtVCP screens - all screens technically will see this issue at some point.
I am using this interface hardware vendor and chipset (e.g., parallel port, ethernet port, FPGA card): Issue is seen on both VM environments and actual hardware with Mesa board.

rene-dev commented 1 year ago

Thanks for investigating this. I will look into this. Ini hal pins had a similar race condition, which luckily was an easy fix.

rene-dev commented 1 year ago

from seb on emc-developers:

When the `linuxcnc` script launches linuxcnc it starts Task, then loads motmod (as one of the commands in the [HAL]HALFILE list).

I believe this is racey, leading to at least some of the intermittent failures on the buildbot.

Task does this at startup:

1. Runs `usrmotInit()` which connects to the emcmotStruct shm, including emcmotCommand and emcmotStatus.

2. Runs `usrmotWriteEmcmotCommand()` to copy the first initialization command into emcmotCommand, then polls
 emcmotStatus waiting for motmod to echo the first command's commandNum.

3. If the commandNum does not echo in emcmotStatus within a timeout, Task gives up on the command.

Motion/motmod does this at startup:

1. Runs `init_comm_buffers()` to connect to the emcmotStruct shm, including emcmotCommand and emcmotStatus.  Calls `memset()` on emcmotStruct to initialize it to all-bytes-zero.

2. Start polling emcmotCommand for new commands (this happens in a hal thread via the `motion-command-handler` function).

The race happens when Task does its 1 and 2, then Motion does *its* 1, discarding Task's first command.

I think the recent move of iocontrol into Task changed system timing and made this race more likely to bite.

Any thoughts?

-- 
Sebastian Kuzminsky

snowgoer540 commented 1 year ago

Thanks Rene, at least I know you’re on the dev email group now :)

rene-dev commented 1 year ago

task used to be started in the background: https://github.com/LinuxCNC/linuxcnc/commit/9fbfb90db93cbbd0ff0fee4293185c0962672c41 I believe this was unintentional, as the & makes the -Wn parameter useless. This was also another race condition, as the halfiles were be parsed while pins are still being added. this leads to the error "emcMotionInit: emcTrajInit failed" as motion is started much later in the script, and task now has to wait. https://github.com/LinuxCNC/linuxcnc/blob/master/scripts/linuxcnc.in#L949

Its a complete coincidence that this ever worked. if you delay the UI startup, and do stat.poll() the content of stat.kinematics_type is 1, but the UI still behaves incorrectly, and is broken in a diffetent way. the dro is missing altogether, and the preview goes black if you zoom.

Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16

rene-dev commented 1 year ago

ah. I just checked out https://github.com/LinuxCNC/linuxcnc/commit/8e8fce81354f4890bd340e7159c0ec3e888e4aa3 and I get the same broken behaviour as described above. no material display, and broken zoom.

snowgoer540 commented 1 year ago

task used to be started in the background: 9fbfb90 I believe this was unintentional, as the & makes the -Wn parameter useless. This was also another race condition, as the halfiles were be parsed while pins are still being added. this leads to the error "emcMotionInit: emcTrajInit failed" as motion is started much later in the script, and task now has to wait. https://github.com/LinuxCNC/linuxcnc/blob/master/scripts/linuxcnc.in#L949

I guess that makes sense of the "error", but if it's not an error, and it's expected behavior at this point then why not remove it?

Its a complete coincidence that this ever worked. if you delay the UI startup, and do stat.poll() the content of stat.kinematics_type is 1, but the UI still behaves incorrectly, and is broken in a diffetent way. the dro is missing altogether, and the preview goes black if you zoom.

I've added the check that axis uses to the qtvcp script, and I dont see this behavior. Everything does seem to work as intended, near as I can tell. If you're saying you've made similar changes and it doesnt work on your end, then there is probably more going on.

Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16

I'm not sure I agree. Why cant the GUI startup be delayed until all of the necessary things that the GUI relies on for success are met? Why should every GUI created in the past and moving forward need to be aware of all this racing?

ah. I just checked out 8e8fce8 and I get the same broken behaviour as described above. no material display, and broken zoom.

This I definitely dont see. I can check that commit out and everything works as it used to.

Perhaps I'm really confused as to what you're saying but I still think the works should be this "fragile" and there must be a way to make it more robust. One thing waits on the other, etc. Even if the load time is 5 seconds longer, who cares if it's always robust and all the pieces are in place?

snowgoer540 commented 1 year ago

For "fun", I was playing with the suggested test-ui.py code. Not to derail this topic, but I think this line:

https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/tests/tool-info/non-random/test-ui.py#L36

Should really be this:

or (status.state != linuxcnc.RCS_DONE) \

According to this: https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/docs/src/config/python-interface.adoc?plain=1#L349

This shows up a few different places, I wanted concurrence before I change it.

rene-dev commented 1 year ago

no matter what I try, I cannot get this screen: https://user-images.githubusercontent.com/26096779/274908725-9a4344ee-0fb1-4275-a7ac-dc7710693a2f.png

try adding a sleep 5 before the UI is started in linuxnc.in

rene-dev commented 1 year ago

For "fun", I was playing with the suggested test-ui.py code. Not to derail this topic, but I think this line:

https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/tests/tool-info/non-random/test-ui.py#L36

Should really be this:
or (status.state != linuxcnc.RCS_DONE) \
According to this:

https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/docs/src/config/python-interface.adoc?plain=1#L349

This shows up a few different places, I wanted concurrence before I change it.

yes, good catch, although it doesnt change the behaviour.

rene-dev commented 1 year ago

task used to be started in the background: 9fbfb90 I believe this was unintentional, as the & makes the -Wn parameter useless. This was also another race condition, as the halfiles were be parsed while pins are still being added. this leads to the error "emcMotionInit: emcTrajInit failed" as motion is started much later in the script, and task now has to wait. https://github.com/LinuxCNC/linuxcnc/blob/master/scripts/linuxcnc.in#L949

I guess that makes sense of the "error", but if it's not an error, and it's expected behavior at this point then why not remove it?

yes, the error should really only be displayed if it cant connect after a few attempts.

Its a complete coincidence that this ever worked. if you delay the UI startup, and do stat.poll() the content of stat.kinematics_type is 1, but the UI still behaves incorrectly, and is broken in a diffetent way. the dro is missing altogether, and the preview goes black if you zoom.

I've added the check that axis uses to the qtvcp script, and I dont see this behavior. Everything does seem to work as intended, near as I can tell. If you're saying you've made similar changes and it doesnt work on your end, then there is probably more going on.

see above, I cannot get the above screen with any linuxcnc version.

Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16

I'm not sure I agree. Why cant the GUI startup be delayed until all of the necessary things that the GUI relies on for success are met? Why should every GUI created in the past and moving forward need to be aware of all this racing?

yes, that is what I wanted to say. There needs to be some proper way of knowing if its ok to start the UI and task and motion is ready without checking a million things.

ah. I just checked out 8e8fce8 and I get the same broken behaviour as described above. no material display, and broken zoom.

This I definitely dont see. I can check that commit out and everything works as it used to.

Perhaps I'm really confused as to what you're saying but I still think the works should be this "fragile" and there must be a way to make it more robust. One thing waits on the other, etc. Even if the load time is 5 seconds longer, who cares if it's always robust and all the pieces are in place?

rene-dev commented 1 year ago

this is what it looks like on 2.9.1. this is what I also get on master if I delay the UI startup.

snowgoer540 commented 1 year ago

startup_delay_count_2.txt

What if you apply this .diff to master? Then can you get to the screen?

Or are you saying no matter what you try with manipulation to the linuxcnc.in script?

If you want me to try adding the 5 second delay, could you point me to the line number to add it?

snowgoer540 commented 1 year ago

This is what I see on 2.9.1 (I'm running Buster at the moment):

Also, should note that this is without any changes applied.

rene-dev commented 1 year ago

neither with the patch on master, nor on 2.9.1 I can see the material view.

identityKinematicsSetup: coordinates:xyyz
   Joint 0 ==> Axis X
   Joint 1 ==> Axis Y
   Joint 2 ==> Axis Y
   Joint 3 ==> Axis Z
identityKinematicsSetup: Recommend: kinstype=both

[QTvcp.QTVCP.LIB.AUDIO_PLAYER][WARNING]  audio alerts - Is python3-espeak installed? (sudo apt install python3-espeak) (audio_player.py:62)
USRMOT: ERROR: command 30 timeout
emcMotionInit: emcTrajInit failed

*** It took linuxcnc 1.30 seconds to become ready ***

No option 'Offset Feed Rate' in section: 'PLASMA_PARAMETERS'
[QTvcp.QTVCP.LIB.AUDIO_PLAYER][WARNING]  audio alerts - Is python3-espeak installed? (sudo apt install python3-espeak) (audio_player.py:62)

*** It took 5.43 seconds for this QTVCP screen to load ***

snowgoer540 commented 1 year ago

I think then that you have an issue related to the preview itself. I am guessing you're on Bookworm / latest and greatest? Must be an issue with the opengl stuff.

rene-dev commented 1 year ago

on master without the patch I get the dro, with the patch nothing. adding sleep 5 after # 4.3.11. Run display in foreground in linuxcnc.in also gets me the black screen, and no dro. so same effect as your patch. there is more going on here, its not related to the taskio merge, as its also broken in 2.9.

rene-dev commented 1 year ago

no, the 3d preview itself is working. but instead of this: https://user-images.githubusercontent.com/26096779/274908725-9a4344ee-0fb1-4275-a7ac-dc7710693a2f.png I just get nothing.

rene-dev commented 1 year ago

see the screenshot. https://github.com/LinuxCNC/linuxcnc/issues/2700#issuecomment-1773892896

snowgoer540 commented 1 year ago

no, the 3d preview itself is working.

I'm not so sure; you're also missing the red border indicating the soft limits.

snowgoer540 commented 1 year ago

This is what I get on master with a 5 second sleep after 4.3.11:

rene-dev commented 1 year ago

Red border is there, I just moved it a bit. The crosshair is also there. Try more delay.

snowgoer540 commented 1 year ago

Sorry, I misunderstood where to put the sleep. I realize now you want it after the comment.

If I put a sleep 5 after L954, then I get this, which on my end is promising lol:

if I add a 30 second sleep, I still get this:

I really think the DRO being missing is a opengl issue for you. Can you test on a system that's not running the latest and greatest?

rene-dev commented 1 year ago

its not an opengl issue. everything else works. I get the same broken behaviour on 2 different machines. running ubuntu 22.04 and 23.10. broken on 2.9.1

snowgoer540 commented 1 year ago

Geez, this is embarrassing.

Go to the settings tab, and click "View Material" (I thought this was the default).

I'm sorry Rene, idk why that didnt occur to me first thing.

Now we can move on to being productive again :)

rene-dev commented 1 year ago

doesnt change a thing. trying this on 2.9.1.

snowgoer540 commented 1 year ago

I'm not sure, as it works here.

Does the DRO show up in Axis?

rene-dev commented 1 year ago

yes, in axis everything works.

snowgoer540 commented 1 year ago

Unsure, unless it's a dependency issue as I know qtvcp screens use qt5_graphics with glcanon

I believe you'd see an error if they weren't, but at a cursory look, the two it checks for are: python3-pyqt5.qtopengl python3-openGL

rene-dev commented 1 year ago

its 100% a race condition. it works on master without patches on a much much slower computer. you are only seeing it now because of the improved startup speed. I can never get it to work, even on 2.9 probably because my main development machine is fast.

snowgoer540 commented 1 year ago

Interesting, I wonder why the .diff I posted doesnt work for you and it does for me.

Similarly adding a sleep to the linuxcnc.in script works for me too.

rene-dev commented 1 year ago

this should really be fixed in 2.9 then. do you know where the code is that shows the material view?

snowgoer540 commented 1 year ago

Yes, qtplasmac patches this method: https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/src/emc/usr_intf/gremlin/qt5_graphics.py#L632

the qtplasmac handler does that here: https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/share/qtvcp/screens/qtplasmac/qtplasmac_handler.py#L658

But that's not really the root cause, at least not as far as what I see over here. When it doesnt work it's because it's showing the Joints DRO, and it shows the joints DRO because it didnt have enough information at the time the GUI/preview, etc. was populated.

That's why I'm shocked that it doesnt work for you with neither the sleep in linuxcnc.in, nor the diff I posted here. Both of those "fix" this for me.

c-morley commented 1 year ago

Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16

Wouldn't the linuxcnc script be the appropriate place to be checking this? The linuxcnc script is the startup sequencer after all. (I'd prefer it written in python of course)

snowgoer540 commented 1 year ago

Some more qtvcp complaints about this:

https://forum.linuxcnc.org/qtvcp/44747-error-in-tool-offsetview-py?start=220#283517

https://forum.linuxcnc.org/qtvcp/50477-qtdragon-hd-with-mdi-issue#283844

rene-dev commented 1 year ago

I just tried your latest commit, does not fix it for me.

snowgoer540 commented 1 year ago

Not sure, it fixes the issue for QtVCP for me, Chris M., and Phill. In addition it fixes issues Phill has been seeing with Axis.

It may not be the end-all-be-all solution, but at this point it seems to be the >95% solution, which will work/buy time until we get the rest of the way there.

phillc54 commented 1 year ago

It works for me on my Debian 12 VM. Previously all GUI's were failing quite often but randomly. This is trying to run the axis_mm sim immediately prior to the latest commit:

rene-dev commented 1 year ago

https://github.com/LinuxCNC/linuxcnc/commit/69207fffd5a3c6d178aa738a24db442c56d79a4a seems to cause the CI to timeout while running tests after 5 hours.

c-morley commented 1 year ago

g0 test runs on forever...

means little to me but in the end looks like this:

Issuing EMC_TRAJ_SET_TERM_COND -- ( +222,+112, +0, +2,0.001000,) (time=1698893894.528025,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528041,pid=410370): rcs_sem_post(262204) called. NML_INTERP_LIST(0x55cdfd25dbe0)::get(): {size=176, type=EMC_TRAJ_SET_G5X}, list_size=2 emcTaskPlanLevel() returned 0 (time=1698893894.528064,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528077,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528091,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528103,pid=410370): rcs_sem_post(262204) called. Issuing EMC_TRAJ_SET_G5X -- ( +224,+176, +0, +1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,) (time=1698893894.528130,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528143,pid=410370): rcs_sem_post(262204) called. NML_INTERP_LIST(0x55cdfd25dbe0)::get(): {size=168, type=EMC_TRAJ_SET_G92}, list_size=1 emcTaskPlanLevel() returned 0 (time=1698893894.528164,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528177,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528191,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528204,pid=410370): rcs_sem_post(262204) called. Issuing EMC_TRAJ_SET_G92 -- ( +227,+168, +0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,) (time=1698893894.528228,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528241,pid=410370): rcs_sem_post(262204) called. NML_INTERP_LIST(0x55cdfd25dbe0)::get(): {size=104, type=EMC_TRAJ_SET_ROTATION}, list_size=0 emcTaskPlanLevel() returned 0 (time=1698893894.528263,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528275,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528289,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528302,pid=410370): rcs_sem_post(262204) called. Issuing EMC_TRAJ_SET_ROTATION -- ( +226,+104, +0,0.000000,) (time=1698893894.528323,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528336,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528350,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528628,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528644,pid=410370): rcs_sem_post(262205) called. (time=1698893894.529683,pid=410370): rcs_sem_post(262204) called. (time=1698893894.529700,pid=410370): rcs_sem_post(262205) called. (time=1698893894.530739,pid=410370): rcs_sem_post(262204) called. (time=1698893894.530761,pid=410370): rcs_sem_post(262205) called.

forever....

c-morley commented 1 year ago

hmm well I guess it does end...

(time=1698894120.211988,pid=410370): rcs_sem_post(262204) called. (time=1698894120.212019,pid=410370): rcs_sem_post(262205) called. (time=1698894120.212331,pid=413809): rcs_sem_post(262205) called. (time=1698894120.212368,pid=413809): deleting NML (4) (time=1698894120.212384,pid=413809): delete (CMS ) 0x55f9014256f0; task: 0 cycles, min=0.000026, max=0.004505, avg=-1.000000, 0 latency excursions (> 10x expected cycle time of 0.001000s) (time=1698894120.212402,pid=413809): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0x7fd03fede000) (time=1698894120.212442,pid=413809): deleting CMS (emcError) (time=1698894120.212460,pid=413809): free( data = 0x55f901426390); (time=1698894120.212475,pid=413809): Leaving ~CMS() (time=1698894120.212489,pid=413809): CMS::delete(0x55f9014256f0) (time=1698894120.212505,pid=413809): CMS::delete successful. (time=1698894120.212522,pid=413809): Leaving ~NML() (time=1698894120.212536,pid=413809): NML::operator delete(0x55f9014254c0) (time=1698894120.212551,pid=413809): deleting NML (3) (time=1698894120.212566,pid=413809): delete (CMS ) 0x55f90141f4f0; (time=1698894120.212582,pid=413809): rcs_shm_close(shm->key=1002(0x3EA),shm->size=20480(0x5000),shm->addr=0x7fd03fee0000) (time=1698894120.212610,pid=413809): deleting CMS (emcStatus) (time=1698894120.212630,pid=413809): free( data = 0x55f901420200); (time=1698894120.212647,pid=413809): Leaving ~CMS() (time=1698894120.212664,pid=413809): CMS::delete(0x55f90141f4f0) (time=1698894120.212681,pid=413809): CMS::delete successful. (time=1698894120.212697,pid=413809): Leaving ~NML() (time=1698894120.212712,pid=413809): NML::operator delete(0x55f90141f320) (time=1698894120.212731,pid=413809): deleting NML (1) (time=1698894120.212747,pid=413809): delete (CMS ) 0x55f90141b300; (time=1698894120.212763,pid=413809): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0x7fd03fee5000) (time=1698894120.212794,pid=413809): deleting CMS (emcCommand) (time=1698894120.212814,pid=413809): free( data = 0x55f90141d0b0); (time=1698894120.212830,pid=413809): Leaving ~CMS() (time=1698894120.212846,pid=413809): CMS::delete(0x55f90141b300) (time=1698894120.212861,pid=413809): CMS::delete successful. (time=1698894120.212877,pid=413809): Leaving ~NML() (time=1698894120.212894,pid=413809): NML::operator delete(0x55f90141b090) Shutting down and cleaning up LinuxCNC... (time=1698894120.216126,pid=410370): deleting NML (3) (time=1698894120.216173,pid=410370): delete (CMS ) 0x55cdfe58f210; (time=1698894120.216203,pid=410370): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0x7fd188094000) (time=1698894120.216271,pid=410370): deleting CMS (emcError) (time=1698894120.216292,pid=410370): free( data = 0x55cdfe58ff30); (time=1698894120.216313,pid=410370): Leaving ~CMS() (time=1698894120.216332,pid=410370): CMS::delete(0x55cdfe58f210) (time=1698894120.216350,pid=410370): CMS::delete successful. (time=1698894120.216372,pid=410370): Leaving ~NML() (time=1698894120.216390,pid=410370): NML::operator delete(0x55cdfe583900) (time=1698894120.216412,pid=410370): deleting NML (2) (time=1698894120.216430,pid=410370): delete (CMS ) 0x55cdfe589570; (time=1698894120.216450,pid=410370): rcs_shm_close(shm->key=1002(0x3EA),shm->size=20480(0x5000),shm->addr=0x7fd188096000) (time=1698894120.216483,pid=410370): deleting CMS (emcStatus) (time=1698894120.216504,pid=410370): free( data = 0x55cdfe58a1a0); (time=1698894120.216522,pid=410370): Leaving ~CMS() (time=1698894120.216538,pid=410370): CMS::delete(0x55cdfe589570) (time=1698894120.216557,pid=410370): CMS::delete successful. (time=1698894120.216575,pid=410370): Leaving ~NML() (time=1698894120.216592,pid=410370): NML::operator delete(0x55cdfe583420) (time=1698894120.216612,pid=410370): deleting NML (1) (time=1698894120.216630,pid=410370): delete (CMS ) 0x55cdfe586930; (time=1698894120.216650,pid=410370): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0x7fd18809b000) (time=1698894120.216679,pid=410370): deleting CMS (emcCommand) (time=1698894120.216700,pid=410370): free( data = 0x55cdfe587560); (time=1698894120.216718,pid=410370): Leaving ~CMS() (time=1698894120.216734,pid=410370): CMS::delete(0x55cdfe586930) (time=1698894120.216753,pid=410370): CMS::delete successful. (time=1698894120.216770,pid=410370): Leaving ~NML() (time=1698894120.216788,pid=410370): NML::operator delete(0x55cdfe582e30) (time=1698894120.371538,pid=410359): Deleting 3 channels from the NML_Main_Channel_List. (time=1698894120.371600,pid=410359): Deleting emcCommand NML channel from NML_Main_Channel_List. (time=1698894120.371620,pid=410359): deleting NML (1) (time=1698894120.371640,pid=410359): delete (CMS ) 0x564a70049300; (time=1698894120.371668,pid=410359): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0x7fb85ef8f000) (time=1698894120.371726,pid=410359): deleting CMS (emcCommand) (time=1698894120.371754,pid=410359): free( data = 0x564a7004b0b0); (time=1698894120.371779,pid=410359): Leaving ~CMS() (time=1698894120.371799,pid=410359): CMS::delete(0x564a70049300) (time=1698894120.371818,pid=410359): CMS::delete successful. (time=1698894120.371838,pid=410359): Leaving ~NML() (time=1698894120.371856,pid=410359): NML channel deleted from NML_Main_Channel_List (time=1698894120.371873,pid=410359): Deleting emcStatus NML channel from NML_Main_Channel_List. (time=1698894120.371892,pid=410359): deleting NML (2) (time=1698894120.371911,pid=410359): delete (CMS ) 0x564a70050520; (time=1698894120.371932,pid=410359): rcs_shm_close(shm->key=1002(0x3EA),shm->size=20480(0x5000),shm->addr=0x7fb85ef8a000) (time=1698894120.371977,pid=410359): deleting CMS (emcStatus) (time=1698894120.371998,pid=410359): free( data = 0x564a70051230); (time=1698894120.372017,pid=410359): Leaving ~CMS() (time=1698894120.372037,pid=410359): CMS::delete(0x564a70050520) (time=1698894120.372055,pid=410359): CMS::delete successful. (time=1698894120.372074,pid=410359): Leaving ~NML() (time=1698894120.372093,pid=410359): NML channel deleted from NML_Main_Channel_List (time=1698894120.372112,pid=410359): Deleting emcError NML channel from NML_Main_Channel_List. (time=1698894120.372130,pid=410359): deleting NML (3) (time=1698894120.372148,pid=410359): delete (CMS *) 0x564a70056910; (time=1698894120.372170,pid=410359): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0x7fb85ef88000) (time=1698894120.372210,pid=410359): deleting CMS (emcError) (time=1698894120.372233,pid=410359): free( data = 0x564a70057630); (time=1698894120.372252,pid=410359): Leaving ~CMS() (time=1698894120.372269,pid=410359): CMS::delete(0x564a70056910) (time=1698894120.372306,pid=410359): CMS::delete successful. (time=1698894120.372325,pid=410359): Leaving ~NML() (time=1698894120.372342,pid=410359): NML channel deleted from NML_Main_Channel_List (time=1698894120.372367,pid=410359): deleting NML (1) (time=1698894120.372385,pid=410359): Leaving ~NML() (time=1698894120.372404,pid=410359): NML::operator delete(0x564a70049090) (time=1698894120.372424,pid=410359): NML channel deleted from Dynamically_Allocated_NML_Objects (time=1698894120.372444,pid=410359): deleting NML (2) (time=1698894120.372462,pid=410359): Leaving ~NML() (time=1698894120.372479,pid=410359): NML::operator delete(0x564a70050350) (time=1698894120.372498,pid=410359): NML channel deleted from Dynamically_Allocated_NML_Objects (time=1698894120.372516,pid=410359): deleting NML (3) (time=1698894120.372534,pid=410359): Leaving ~NML() (time=1698894120.372552,pid=410359): NML::operator delete(0x564a700566e0) (time=1698894120.372570,pid=410359): NML channel deleted from Dynamically_Allocated_NML_Objects

snowgoer540 commented 1 year ago

I don't see how commit 60297ff could have broken these tests, the while loop has a max delay of 10 seconds and then it moves on.

All of these messages look like what happens now when you set RCS_DEBUG = -1 in the [EMC] section of a GUI ini file.

Are you sure this commit isnt the culprit, which did recently change all of the RCS prints/default debug levels, etc?

https://github.com/LinuxCNC/linuxcnc/commit/292511db43a4d4a32b41d17e5ce4426dac199927

Since those debugs print to the terminal, I can imagine how things would move much slower.

c-morley commented 1 year ago

I removed the delay code and the tests pass. Sometimes just printing text to the terminal can fail the test - though I didn't see that in the g0 case. I was hoping someone familar with the test could sged light. I also set the timeout to 5 and still got the failure - ran out of time to try more.

rene-dev commented 1 year ago

look at the CI output of any commit after 69207fffd5a3c6d178aa738a24db442c56d79a4a

snowgoer540 commented 1 year ago

Interestingly, this doesn't always pass with the commit before 69207ff checked out (390f635).

Sometimes it just hangs like this:

emcTaskPlanLevel() returned 0
Issuing EMC_TRAJ_SET_ROTATION --     (  +226,+104,    +0,0.000000,)
finished capturing data
task: main loop took 0.013923 seconds
task: main loop took 0.016397 seconds
task: main loop took 0.016099 seconds
task: main loop took 0.010698 seconds
task: main loop took 0.013226 seconds
task: main loop took 0.014624 seconds
task: main loop took 0.014831 seconds
task: main loop took 0.012450 seconds
task: main loop took 0.011650 seconds
task: main loop took 0.015496 seconds

To me, this points to more racing, that is exacerbated by 69207ff.

snowgoer540 commented 1 year ago

Interestingly, this doesn't always pass with the commit before 69207ff checked out (390f635).

Sometimes it just hangs like this:

emcTaskPlanLevel() returned 0
Issuing EMC_TRAJ_SET_ROTATION --   (  +226,+104,    +0,0.000000,)
finished capturing data
task: main loop took 0.013923 seconds
task: main loop took 0.016397 seconds
task: main loop took 0.016099 seconds
task: main loop took 0.010698 seconds
task: main loop took 0.013226 seconds
task: main loop took 0.014624 seconds
task: main loop took 0.014831 seconds
task: main loop took 0.012450 seconds
task: main loop took 0.011650 seconds
task: main loop took 0.015496 seconds

To me, this points to more racing, that is exacerbated by 69207ff.

snowgoer540 commented 1 year ago

(closing was accidental).

If you have better ideas for fixing all of this racing, I'm interested in helping you test them.

c-morley commented 1 year ago

I switched to just a sleep .01 and it breaks the g0 test.

phillc54 commented 1 year ago

I tried by testing if linuxcncrsh is the display and it seems OK here:


# wait for traj to process for up to 10s before screen loading due to race condition
if [ $EMCDISPLAY != linuxcncrsh ] ; then
    RACE_TIMEOUT=$(( $SECONDS + 10 ))
    chk=`halcmd getp ini.traj_max_velocity`
    while (( $(echo "$chk == 0" | bc -l) )) ; do
        if [ $SECONDS -ge $RACE_TIMEOUT ]; then
            echo "ini.traj_max_velocity still 0.0 after $SECONDS seconds" | tee -a $PRINT_FILE $DEBUG_FILE
            break
        fi
        chk=`halcmd getp ini.traj_max_velocity`
    done
fi

c-morley commented 1 year ago

ok for me sleep 0.1 or 0.01 work - I didn't realize that after a failed test it leaves the realtime system running.

I bet Phill's fix would work - though it seems to indicate to me the test is brittle to begin with.

phillc54 commented 1 year ago

I think brittle test is correct.

All the other tests using linuxcncrsh have the below code immediately after loading linuxcnc.

The g0 test is missing this code, if I add it in it works well here.

EDIT: It works without the linuxcncrsh test in my previous post.


# let linuxcnc come up
TOGO=80
while [  $TOGO -gt 0 ]; do
    echo trying to connect to linuxcncrsh TOGO=$TOGO
    if nc -z localhost 5007; then
        break
    fi
    sleep 0.25
    TOGO=$(($TOGO - 1))
done
if [  $TOGO -eq 0 ]; then
    echo connection to linuxcncrsh timed out
    exit 1
fi

LinuxCNC / linuxcnc