Open snowgoer540 opened 1 year ago
Thanks for investigating this. I will look into this. Ini hal pins had a similar race condition, which luckily was an easy fix.
from seb on emc-developers:
When the `linuxcnc` script launches linuxcnc it starts Task, then loads motmod (as one of the commands in the [HAL]HALFILE list).
I believe this is racey, leading to at least some of the intermittent failures on the buildbot.
Task does this at startup:
1. Runs `usrmotInit()` which connects to the emcmotStruct shm, including emcmotCommand and emcmotStatus.
2. Runs `usrmotWriteEmcmotCommand()` to copy the first initialization command into emcmotCommand, then polls
emcmotStatus waiting for motmod to echo the first command's commandNum.
3. If the commandNum does not echo in emcmotStatus within a timeout, Task gives up on the command.
Motion/motmod does this at startup:
1. Runs `init_comm_buffers()` to connect to the emcmotStruct shm, including emcmotCommand and emcmotStatus. Calls `memset()` on emcmotStruct to initialize it to all-bytes-zero.
2. Start polling emcmotCommand for new commands (this happens in a hal thread via the `motion-command-handler` function).
The race happens when Task does its 1 and 2, then Motion does *its* 1, discarding Task's first command.
I think the recent move of iocontrol into Task changed system timing and made this race more likely to bite.
Any thoughts?
--
Sebastian Kuzminsky
Thanks Rene, at least I know you’re on the dev email group now :)
task used to be started in the background: https://github.com/LinuxCNC/linuxcnc/commit/9fbfb90db93cbbd0ff0fee4293185c0962672c41 I believe this was unintentional, as the & makes the -Wn parameter useless. This was also another race condition, as the halfiles were be parsed while pins are still being added. this leads to the error "emcMotionInit: emcTrajInit failed" as motion is started much later in the script, and task now has to wait. https://github.com/LinuxCNC/linuxcnc/blob/master/scripts/linuxcnc.in#L949
Its a complete coincidence that this ever worked. if you delay the UI startup, and do stat.poll() the content of stat.kinematics_type is 1, but the UI still behaves incorrectly, and is broken in a diffetent way. the dro is missing altogether, and the preview goes black if you zoom.
Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16
ah. I just checked out https://github.com/LinuxCNC/linuxcnc/commit/8e8fce81354f4890bd340e7159c0ec3e888e4aa3 and I get the same broken behaviour as described above. no material display, and broken zoom.
task used to be started in the background: 9fbfb90 I believe this was unintentional, as the & makes the -Wn parameter useless. This was also another race condition, as the halfiles were be parsed while pins are still being added. this leads to the error "emcMotionInit: emcTrajInit failed" as motion is started much later in the script, and task now has to wait. https://github.com/LinuxCNC/linuxcnc/blob/master/scripts/linuxcnc.in#L949
I guess that makes sense of the "error", but if it's not an error, and it's expected behavior at this point then why not remove it?
Its a complete coincidence that this ever worked. if you delay the UI startup, and do stat.poll() the content of stat.kinematics_type is 1, but the UI still behaves incorrectly, and is broken in a diffetent way. the dro is missing altogether, and the preview goes black if you zoom.
I've added the check that axis uses to the qtvcp script, and I dont see this behavior. Everything does seem to work as intended, near as I can tell. If you're saying you've made similar changes and it doesnt work on your end, then there is probably more going on.
Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16
I'm not sure I agree. Why cant the GUI startup be delayed until all of the necessary things that the GUI relies on for success are met? Why should every GUI created in the past and moving forward need to be aware of all this racing?
ah. I just checked out 8e8fce8 and I get the same broken behaviour as described above. no material display, and broken zoom.
This I definitely dont see. I can check that commit out and everything works as it used to.
Perhaps I'm really confused as to what you're saying but I still think the works should be this "fragile" and there must be a way to make it more robust. One thing waits on the other, etc. Even if the load time is 5 seconds longer, who cares if it's always robust and all the pieces are in place?
For "fun", I was playing with the suggested test-ui.py code. Not to derail this topic, but I think this line:
Should really be this:
or (status.state != linuxcnc.RCS_DONE) \
According to this: https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/docs/src/config/python-interface.adoc?plain=1#L349
This shows up a few different places, I wanted concurrence before I change it.
no matter what I try, I cannot get this screen: https://user-images.githubusercontent.com/26096779/274908725-9a4344ee-0fb1-4275-a7ac-dc7710693a2f.png
try adding a sleep 5 before the UI is started in linuxnc.in
For "fun", I was playing with the suggested test-ui.py code. Not to derail this topic, but I think this line:
Should really be this:
or (status.state != linuxcnc.RCS_DONE) \
According to this:
This shows up a few different places, I wanted concurrence before I change it.
yes, good catch, although it doesnt change the behaviour.
task used to be started in the background: 9fbfb90 I believe this was unintentional, as the & makes the -Wn parameter useless. This was also another race condition, as the halfiles were be parsed while pins are still being added. this leads to the error "emcMotionInit: emcTrajInit failed" as motion is started much later in the script, and task now has to wait. https://github.com/LinuxCNC/linuxcnc/blob/master/scripts/linuxcnc.in#L949
I guess that makes sense of the "error", but if it's not an error, and it's expected behavior at this point then why not remove it?
yes, the error should really only be displayed if it cant connect after a few attempts.
Its a complete coincidence that this ever worked. if you delay the UI startup, and do stat.poll() the content of stat.kinematics_type is 1, but the UI still behaves incorrectly, and is broken in a diffetent way. the dro is missing altogether, and the preview goes black if you zoom.
I've added the check that axis uses to the qtvcp script, and I dont see this behavior. Everything does seem to work as intended, near as I can tell. If you're saying you've made similar changes and it doesnt work on your end, then there is probably more going on.
see above, I cannot get the above screen with any linuxcnc version.
Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16
I'm not sure I agree. Why cant the GUI startup be delayed until all of the necessary things that the GUI relies on for success are met? Why should every GUI created in the past and moving forward need to be aware of all this racing?
yes, that is what I wanted to say. There needs to be some proper way of knowing if its ok to start the UI and task and motion is ready without checking a million things.
ah. I just checked out 8e8fce8 and I get the same broken behaviour as described above. no material display, and broken zoom.
This I definitely dont see. I can check that commit out and everything works as it used to.
Perhaps I'm really confused as to what you're saying but I still think the works should be this "fragile" and there must be a way to make it more robust. One thing waits on the other, etc. Even if the load time is 5 seconds longer, who cares if it's always robust and all the pieces are in place?
this is what it looks like on 2.9.1. this is what I also get on master if I delay the UI startup.
What if you apply this .diff to master? Then can you get to the screen?
Or are you saying no matter what you try with manipulation to the linuxcnc.in script?
If you want me to try adding the 5 second delay, could you point me to the line number to add it?
This is what I see on 2.9.1 (I'm running Buster at the moment):
Also, should note that this is without any changes applied.
neither with the patch on master, nor on 2.9.1 I can see the material view.
identityKinematicsSetup: coordinates:xyyz
Joint 0 ==> Axis X
Joint 1 ==> Axis Y
Joint 2 ==> Axis Y
Joint 3 ==> Axis Z
identityKinematicsSetup: Recommend: kinstype=both
[QTvcp.QTVCP.LIB.AUDIO_PLAYER][WARNING] audio alerts - Is python3-espeak installed? (sudo apt install python3-espeak) (audio_player.py:62)
USRMOT: ERROR: command 30 timeout
emcMotionInit: emcTrajInit failed
*** It took linuxcnc 1.30 seconds to become ready ***
No option 'Offset Feed Rate' in section: 'PLASMA_PARAMETERS'
[QTvcp.QTVCP.LIB.AUDIO_PLAYER][WARNING] audio alerts - Is python3-espeak installed? (sudo apt install python3-espeak) (audio_player.py:62)
*** It took 5.43 seconds for this QTVCP screen to load ***
I think then that you have an issue related to the preview itself. I am guessing you're on Bookworm / latest and greatest? Must be an issue with the opengl stuff.
on master without the patch I get the dro, with the patch nothing. adding sleep 5 after # 4.3.11. Run display in foreground in linuxcnc.in also gets me the black screen, and no dro. so same effect as your patch. there is more going on here, its not related to the taskio merge, as its also broken in 2.9.
no, the 3d preview itself is working. but instead of this: https://user-images.githubusercontent.com/26096779/274908725-9a4344ee-0fb1-4275-a7ac-dc7710693a2f.png I just get nothing.
see the screenshot. https://github.com/LinuxCNC/linuxcnc/issues/2700#issuecomment-1773892896
no, the 3d preview itself is working.
I'm not so sure; you're also missing the red border indicating the soft limits.
This is what I get on master with a 5 second sleep after 4.3.11:
Red border is there, I just moved it a bit. The crosshair is also there. Try more delay.
Sorry, I misunderstood where to put the sleep. I realize now you want it after the comment.
If I put a sleep 5 after L954, then I get this, which on my end is promising lol:
if I add a 30 second sleep, I still get this:
I really think the DRO being missing is a opengl issue for you. Can you test on a system that's not running the latest and greatest?
its not an opengl issue. everything else works. I get the same broken behaviour on 2 different machines. running ubuntu 22.04 and 23.10. broken on 2.9.1
Geez, this is embarrassing.
Go to the settings tab, and click "View Material" (I thought this was the default).
I'm sorry Rene, idk why that didnt occur to me first thing.
Now we can move on to being productive again :)
doesnt change a thing. trying this on 2.9.1.
I'm not sure, as it works here.
Does the DRO show up in Axis?
yes, in axis everything works.
Unsure, unless it's a dependency issue as I know qtvcp screens use qt5_graphics with glcanon
I believe you'd see an error if they weren't, but at a cursory look, the two it checks for are: python3-pyqt5.qtopengl python3-openGL
its 100% a race condition. it works on master without patches on a much much slower computer. you are only seeing it now because of the improved startup speed. I can never get it to work, even on 2.9 probably because my main development machine is fast.
Interesting, I wonder why the .diff I posted doesnt work for you and it does for me.
Similarly adding a sleep to the linuxcnc.in script works for me too.
this should really be fixed in 2.9 then. do you know where the code is that shows the material view?
Yes, qtplasmac patches this method: https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/src/emc/usr_intf/gremlin/qt5_graphics.py#L632
the qtplasmac handler does that here: https://github.com/LinuxCNC/linuxcnc/blob/cd1e96a74d5b172f41503a01e35fad3d2d9e2948/share/qtvcp/screens/qtplasmac/qtplasmac_handler.py#L658
But that's not really the root cause, at least not as far as what I see over here. When it doesnt work it's because it's showing the Joints DRO, and it shows the joints DRO because it didnt have enough information at the time the GUI/preview, etc. was populated.
That's why I'm shocked that it doesnt work for you with neither the sleep in linuxcnc.in, nor the diff I posted here. Both of those "fix" this for me.
Im guessing that it should somehow be the UIs responsibility to wait until everything else has started up correctly? the tests seem to be doing this: https://github.com/LinuxCNC/linuxcnc/blob/master/tests/tool-info/non-random/test-ui.py#L16
Wouldn't the linuxcnc script be the appropriate place to be checking this? The linuxcnc script is the startup sequencer after all. (I'd prefer it written in python of course)
Some more qtvcp complaints about this:
https://forum.linuxcnc.org/qtvcp/44747-error-in-tool-offsetview-py?start=220#283517
https://forum.linuxcnc.org/qtvcp/50477-qtdragon-hd-with-mdi-issue#283844
I just tried your latest commit, does not fix it for me.
Not sure, it fixes the issue for QtVCP for me, Chris M., and Phill. In addition it fixes issues Phill has been seeing with Axis.
It may not be the end-all-be-all solution, but at this point it seems to be the >95% solution, which will work/buy time until we get the rest of the way there.
It works for me on my Debian 12 VM. Previously all GUI's were failing quite often but randomly. This is trying to run the axis_mm sim immediately prior to the latest commit:
https://github.com/LinuxCNC/linuxcnc/commit/69207fffd5a3c6d178aa738a24db442c56d79a4a seems to cause the CI to timeout while running tests after 5 hours.
g0 test runs on forever...
means little to me but in the end looks like this:
Issuing EMC_TRAJ_SET_TERM_COND -- ( +222,+112, +0, +2,0.001000,) (time=1698893894.528025,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528041,pid=410370): rcs_sem_post(262204) called. NML_INTERP_LIST(0x55cdfd25dbe0)::get(): {size=176, type=EMC_TRAJ_SET_G5X}, list_size=2 emcTaskPlanLevel() returned 0 (time=1698893894.528064,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528077,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528091,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528103,pid=410370): rcs_sem_post(262204) called. Issuing EMC_TRAJ_SET_G5X -- ( +224,+176, +0, +1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,) (time=1698893894.528130,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528143,pid=410370): rcs_sem_post(262204) called. NML_INTERP_LIST(0x55cdfd25dbe0)::get(): {size=168, type=EMC_TRAJ_SET_G92}, list_size=1 emcTaskPlanLevel() returned 0 (time=1698893894.528164,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528177,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528191,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528204,pid=410370): rcs_sem_post(262204) called. Issuing EMC_TRAJ_SET_G92 -- ( +227,+168, +0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,) (time=1698893894.528228,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528241,pid=410370): rcs_sem_post(262204) called. NML_INTERP_LIST(0x55cdfd25dbe0)::get(): {size=104, type=EMC_TRAJ_SET_ROTATION}, list_size=0 emcTaskPlanLevel() returned 0 (time=1698893894.528263,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528275,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528289,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528302,pid=410370): rcs_sem_post(262204) called. Issuing EMC_TRAJ_SET_ROTATION -- ( +226,+104, +0,0.000000,) (time=1698893894.528323,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528336,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528350,pid=410370): rcs_sem_post(262205) called. (time=1698893894.528628,pid=410370): rcs_sem_post(262204) called. (time=1698893894.528644,pid=410370): rcs_sem_post(262205) called. (time=1698893894.529683,pid=410370): rcs_sem_post(262204) called. (time=1698893894.529700,pid=410370): rcs_sem_post(262205) called. (time=1698893894.530739,pid=410370): rcs_sem_post(262204) called. (time=1698893894.530761,pid=410370): rcs_sem_post(262205) called.
forever....
hmm well I guess it does end...
(time=1698894120.211988,pid=410370): rcs_sem_post(262204) called. (time=1698894120.212019,pid=410370): rcs_sem_post(262205) called. (time=1698894120.212331,pid=413809): rcs_sem_post(262205) called. (time=1698894120.212368,pid=413809): deleting NML (4) (time=1698894120.212384,pid=413809): delete (CMS ) 0x55f9014256f0; task: 0 cycles, min=0.000026, max=0.004505, avg=-1.000000, 0 latency excursions (> 10x expected cycle time of 0.001000s) (time=1698894120.212402,pid=413809): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0x7fd03fede000) (time=1698894120.212442,pid=413809): deleting CMS (emcError) (time=1698894120.212460,pid=413809): free( data = 0x55f901426390); (time=1698894120.212475,pid=413809): Leaving ~CMS() (time=1698894120.212489,pid=413809): CMS::delete(0x55f9014256f0) (time=1698894120.212505,pid=413809): CMS::delete successful. (time=1698894120.212522,pid=413809): Leaving ~NML() (time=1698894120.212536,pid=413809): NML::operator delete(0x55f9014254c0) (time=1698894120.212551,pid=413809): deleting NML (3) (time=1698894120.212566,pid=413809): delete (CMS ) 0x55f90141f4f0; (time=1698894120.212582,pid=413809): rcs_shm_close(shm->key=1002(0x3EA),shm->size=20480(0x5000),shm->addr=0x7fd03fee0000) (time=1698894120.212610,pid=413809): deleting CMS (emcStatus) (time=1698894120.212630,pid=413809): free( data = 0x55f901420200); (time=1698894120.212647,pid=413809): Leaving ~CMS() (time=1698894120.212664,pid=413809): CMS::delete(0x55f90141f4f0) (time=1698894120.212681,pid=413809): CMS::delete successful. (time=1698894120.212697,pid=413809): Leaving ~NML() (time=1698894120.212712,pid=413809): NML::operator delete(0x55f90141f320) (time=1698894120.212731,pid=413809): deleting NML (1) (time=1698894120.212747,pid=413809): delete (CMS ) 0x55f90141b300; (time=1698894120.212763,pid=413809): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0x7fd03fee5000) (time=1698894120.212794,pid=413809): deleting CMS (emcCommand) (time=1698894120.212814,pid=413809): free( data = 0x55f90141d0b0); (time=1698894120.212830,pid=413809): Leaving ~CMS() (time=1698894120.212846,pid=413809): CMS::delete(0x55f90141b300) (time=1698894120.212861,pid=413809): CMS::delete successful. (time=1698894120.212877,pid=413809): Leaving ~NML() (time=1698894120.212894,pid=413809): NML::operator delete(0x55f90141b090) Shutting down and cleaning up LinuxCNC... (time=1698894120.216126,pid=410370): deleting NML (3) (time=1698894120.216173,pid=410370): delete (CMS ) 0x55cdfe58f210; (time=1698894120.216203,pid=410370): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0x7fd188094000) (time=1698894120.216271,pid=410370): deleting CMS (emcError) (time=1698894120.216292,pid=410370): free( data = 0x55cdfe58ff30); (time=1698894120.216313,pid=410370): Leaving ~CMS() (time=1698894120.216332,pid=410370): CMS::delete(0x55cdfe58f210) (time=1698894120.216350,pid=410370): CMS::delete successful. (time=1698894120.216372,pid=410370): Leaving ~NML() (time=1698894120.216390,pid=410370): NML::operator delete(0x55cdfe583900) (time=1698894120.216412,pid=410370): deleting NML (2) (time=1698894120.216430,pid=410370): delete (CMS ) 0x55cdfe589570; (time=1698894120.216450,pid=410370): rcs_shm_close(shm->key=1002(0x3EA),shm->size=20480(0x5000),shm->addr=0x7fd188096000) (time=1698894120.216483,pid=410370): deleting CMS (emcStatus) (time=1698894120.216504,pid=410370): free( data = 0x55cdfe58a1a0); (time=1698894120.216522,pid=410370): Leaving ~CMS() (time=1698894120.216538,pid=410370): CMS::delete(0x55cdfe589570) (time=1698894120.216557,pid=410370): CMS::delete successful. (time=1698894120.216575,pid=410370): Leaving ~NML() (time=1698894120.216592,pid=410370): NML::operator delete(0x55cdfe583420) (time=1698894120.216612,pid=410370): deleting NML (1) (time=1698894120.216630,pid=410370): delete (CMS ) 0x55cdfe586930; (time=1698894120.216650,pid=410370): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0x7fd18809b000) (time=1698894120.216679,pid=410370): deleting CMS (emcCommand) (time=1698894120.216700,pid=410370): free( data = 0x55cdfe587560); (time=1698894120.216718,pid=410370): Leaving ~CMS() (time=1698894120.216734,pid=410370): CMS::delete(0x55cdfe586930) (time=1698894120.216753,pid=410370): CMS::delete successful. (time=1698894120.216770,pid=410370): Leaving ~NML() (time=1698894120.216788,pid=410370): NML::operator delete(0x55cdfe582e30) (time=1698894120.371538,pid=410359): Deleting 3 channels from the NML_Main_Channel_List. (time=1698894120.371600,pid=410359): Deleting emcCommand NML channel from NML_Main_Channel_List. (time=1698894120.371620,pid=410359): deleting NML (1) (time=1698894120.371640,pid=410359): delete (CMS ) 0x564a70049300; (time=1698894120.371668,pid=410359): rcs_shm_close(shm->key=1001(0x3E9),shm->size=8192(0x2000),shm->addr=0x7fb85ef8f000) (time=1698894120.371726,pid=410359): deleting CMS (emcCommand) (time=1698894120.371754,pid=410359): free( data = 0x564a7004b0b0); (time=1698894120.371779,pid=410359): Leaving ~CMS() (time=1698894120.371799,pid=410359): CMS::delete(0x564a70049300) (time=1698894120.371818,pid=410359): CMS::delete successful. (time=1698894120.371838,pid=410359): Leaving ~NML() (time=1698894120.371856,pid=410359): NML channel deleted from NML_Main_Channel_List (time=1698894120.371873,pid=410359): Deleting emcStatus NML channel from NML_Main_Channel_List. (time=1698894120.371892,pid=410359): deleting NML (2) (time=1698894120.371911,pid=410359): delete (CMS ) 0x564a70050520; (time=1698894120.371932,pid=410359): rcs_shm_close(shm->key=1002(0x3EA),shm->size=20480(0x5000),shm->addr=0x7fb85ef8a000) (time=1698894120.371977,pid=410359): deleting CMS (emcStatus) (time=1698894120.371998,pid=410359): free( data = 0x564a70051230); (time=1698894120.372017,pid=410359): Leaving ~CMS() (time=1698894120.372037,pid=410359): CMS::delete(0x564a70050520) (time=1698894120.372055,pid=410359): CMS::delete successful. (time=1698894120.372074,pid=410359): Leaving ~NML() (time=1698894120.372093,pid=410359): NML channel deleted from NML_Main_Channel_List (time=1698894120.372112,pid=410359): Deleting emcError NML channel from NML_Main_Channel_List. (time=1698894120.372130,pid=410359): deleting NML (3) (time=1698894120.372148,pid=410359): delete (CMS *) 0x564a70056910; (time=1698894120.372170,pid=410359): rcs_shm_close(shm->key=1003(0x3EB),shm->size=8192(0x2000),shm->addr=0x7fb85ef88000) (time=1698894120.372210,pid=410359): deleting CMS (emcError) (time=1698894120.372233,pid=410359): free( data = 0x564a70057630); (time=1698894120.372252,pid=410359): Leaving ~CMS() (time=1698894120.372269,pid=410359): CMS::delete(0x564a70056910) (time=1698894120.372306,pid=410359): CMS::delete successful. (time=1698894120.372325,pid=410359): Leaving ~NML() (time=1698894120.372342,pid=410359): NML channel deleted from NML_Main_Channel_List (time=1698894120.372367,pid=410359): deleting NML (1) (time=1698894120.372385,pid=410359): Leaving ~NML() (time=1698894120.372404,pid=410359): NML::operator delete(0x564a70049090) (time=1698894120.372424,pid=410359): NML channel deleted from Dynamically_Allocated_NML_Objects (time=1698894120.372444,pid=410359): deleting NML (2) (time=1698894120.372462,pid=410359): Leaving ~NML() (time=1698894120.372479,pid=410359): NML::operator delete(0x564a70050350) (time=1698894120.372498,pid=410359): NML channel deleted from Dynamically_Allocated_NML_Objects (time=1698894120.372516,pid=410359): deleting NML (3) (time=1698894120.372534,pid=410359): Leaving ~NML() (time=1698894120.372552,pid=410359): NML::operator delete(0x564a700566e0) (time=1698894120.372570,pid=410359): NML channel deleted from Dynamically_Allocated_NML_Objects
I don't see how commit 60297ff could have broken these tests, the while loop has a max delay of 10 seconds and then it moves on.
All of these messages look like what happens now when you set RCS_DEBUG = -1 in the [EMC] section of a GUI ini file.
Are you sure this commit isnt the culprit, which did recently change all of the RCS prints/default debug levels, etc?
https://github.com/LinuxCNC/linuxcnc/commit/292511db43a4d4a32b41d17e5ce4426dac199927
Since those debugs print to the terminal, I can imagine how things would move much slower.
I removed the delay code and the tests pass. Sometimes just printing text to the terminal can fail the test - though I didn't see that in the g0 case. I was hoping someone familar with the test could sged light. I also set the timeout to 5 and still got the failure - ran out of time to try more.
look at the CI output of any commit after 69207fffd5a3c6d178aa738a24db442c56d79a4a
Interestingly, this doesn't always pass with the commit before 69207ff checked out (390f635).
Sometimes it just hangs like this:
emcTaskPlanLevel() returned 0
Issuing EMC_TRAJ_SET_ROTATION -- ( +226,+104, +0,0.000000,)
finished capturing data
task: main loop took 0.013923 seconds
task: main loop took 0.016397 seconds
task: main loop took 0.016099 seconds
task: main loop took 0.010698 seconds
task: main loop took 0.013226 seconds
task: main loop took 0.014624 seconds
task: main loop took 0.014831 seconds
task: main loop took 0.012450 seconds
task: main loop took 0.011650 seconds
task: main loop took 0.015496 seconds
To me, this points to more racing, that is exacerbated by 69207ff.
Interestingly, this doesn't always pass with the commit before 69207ff checked out (390f635).
Sometimes it just hangs like this:
emcTaskPlanLevel() returned 0 Issuing EMC_TRAJ_SET_ROTATION -- ( +226,+104, +0,0.000000,) finished capturing data task: main loop took 0.013923 seconds task: main loop took 0.016397 seconds task: main loop took 0.016099 seconds task: main loop took 0.010698 seconds task: main loop took 0.013226 seconds task: main loop took 0.014624 seconds task: main loop took 0.014831 seconds task: main loop took 0.012450 seconds task: main loop took 0.011650 seconds task: main loop took 0.015496 seconds
To me, this points to more racing, that is exacerbated by 69207ff.
(closing was accidental).
If you have better ideas for fixing all of this racing, I'm interested in helping you test them.
I switched to just a sleep .01 and it breaks the g0 test.
I tried by testing if linuxcncrsh is the display and it seems OK here:
# wait for traj to process for up to 10s before screen loading due to race condition
if [ $EMCDISPLAY != linuxcncrsh ] ; then
RACE_TIMEOUT=$(( $SECONDS + 10 ))
chk=`halcmd getp ini.traj_max_velocity`
while (( $(echo "$chk == 0" | bc -l) )) ; do
if [ $SECONDS -ge $RACE_TIMEOUT ]; then
echo "ini.traj_max_velocity still 0.0 after $SECONDS seconds" | tee -a $PRINT_FILE $DEBUG_FILE
break
fi
chk=`halcmd getp ini.traj_max_velocity`
done
fi
ok for me sleep 0.1 or 0.01 work - I didn't realize that after a failed test it leaves the realtime system running.
I bet Phill's fix would work - though it seems to indicate to me the test is brittle to begin with.
I think brittle test is correct.
All the other tests using linuxcncrsh have the below code immediately after loading linuxcnc.
The g0 test is missing this code, if I add it in it works well here.
EDIT: It works without the linuxcncrsh test in my previous post.
# let linuxcnc come up
TOGO=80
while [ $TOGO -gt 0 ]; do
echo trying to connect to linuxcncrsh TOGO=$TOGO
if nc -z localhost 5007; then
break
fi
sleep 0.25
TOGO=$(($TOGO - 1))
done
if [ $TOGO -eq 0 ]; then
echo connection to linuxcncrsh timed out
exit 1
fi
I am creating this issue log at the request of Petter here: https://github.com/LinuxCNC/linuxcnc/commit/9d79c2aa23e6e6762dfc32b4ad7c4b7d219e9ea4#commitcomment-129891775
A race condition exists upon startup of LinuxCNC. It was made more noticeable by the recent taskio commits (7ebe7d9 / 9d79c2a), which I posit is due to the faster loading times. When this happens, the GUI/preview components are populated before the kinematics setup has been processed. For some screens, this precludes loading into "world view" (X Y Z DRO instead of 0 1 2 DRO), for others it causes crashes.
Here are the steps I follow to reproduce the issue:
This can be seen by starting a simple config of QtPlasmaC. The screen should load showing the material parameters (via the modified world view DRO), and instead loads with Joints DRO.
Correct:
Incorrect:
For other GUIs, such as QtDragon (not QtDragon HD), the GUI just crashes.
For Axis, there is a GUI-level work around, but it does not always work (more on this later)
I think the reason is because emcMotionInit is being called before the kinematics setup processing is completed. Below are two dumps from the terminal, one from a good run (commit 8e8fce8 checked out - this is the last commit where this worked) and the other a bad run from current master. Take note of the location of ">>>taskintf.cc called emcMotionInit<<<". I placed that print command at the beginning of the emcMotionTaskInit function directly above the line here:
https://github.com/LinuxCNC/linuxcnc/blob/c123203ab1d2da5ac9115892ca757635c6fd91a6/src/emc/task/taskintf.cc#L1675
good run (worth noting that self.no_joint_display is always True in this case):
bad run (worth noting that self.no_joint_display is always False in this case):
For the good run, notice that ">>>taskintf.cc called emcMotionInit<<<" happens after the kinematics are set up (the specific print is L319 of src/emc/kins_util.c), however for the bad run, ">>>taskintf.cc called emcMotionInit<<<" happens in the middle of the kinematics setup. It's interesting to note that the number of times emcMotionInit gets called seems to differ.
The last thing is that this happens much less frequently in Axis and the reason, it seems is that Axis specifically checks to see if the kinematics are set up before proceeding here:
https://github.com/LinuxCNC/linuxcnc/blob/c123203ab1d2da5ac9115892ca757635c6fd91a6/src/emc/usr_intf/axis/scripts/axis.py#L3560
Note that for some users that have been helping me test this, Axis does still sometimes fail after the prescribed number of loops expire. Similar checking could be done to QtVCP screens, and testing shows it does work. But this seems like a Band-Aid to me, and it does not eliminate the root cause.
It seems like the kinematics setup should be done before the screen building can be allowed to begin (or before emcMotionInit is called (or maybe this is part of the screen build). I am a little fuzzy as to the order of of load operations here and what need to wait for what. But it doesn't seem to be quite ironed out.
This is what I expected to happen:
GUIs should load after kinematics processing is completed to ensure joints and axes, etc. are all populated correctly as the GUI and the preview screen makes decisions based on these populations.
This is what happened instead:
It rarely completes the kinematic setup in time, so best case there are DRO type issues (with perhaps other unforeseen consequences), and worst case GUIs fail to load.
It worked properly before this:
Commit 8e8fce8 is the last commit where this worked much more reliably. It's important to note that this was still an issue, just much less likely to be encountered.
Information about my hardware and software:
lsb_release -a
):This has been tested Buster to Bookworm, all perform the same.
uname -a
):scripts/get-version-from-git
): master