Open donalm opened 2 years ago
I gave this a try in my local environment. It's not exactly a vanilla setup -- it's master
plus a bunch of other fixes I have in progress at the moment to get things working on my M1 mac. But a test job with sleep 1000
finished as expected.
I just tried this again with the dockerized config on another machine and it failed before hitting 6 minutes, so I'm becoming more confident that it's a bug in 0.15.22 (and I suspect it's still in master).
Assuming that this has been fixed in your branch - can you please give me any guidance on when that might get merged Brian? Or if that branch is public I could check it out myself?
Just noticed -- I see in that log you posted:
killMessage OpenCue could not verify this frame
That message means that Cuebot killed the frame because it thought the host shouldn't be working on that frame anymore.
So now the question is, why is Cuebot reassigning that frame?
There are some scheduler changes in master
now that have come since that last release, like #1103. I'd be interested to know if Cuebot from master
shows the same behavior.
@DiegoTavares Any thoughts here? Seen this before?
BTW I'm working to get a new release out ASAP. A few issues have unfortunately slowed this down -- #1192 in particular has been breaking our CI pipelines that we use to test and release, and #1204 has been slowing me down in particular. Hoping to get these all resolved soon.
I noticed in earlier tests that when the rqd node is working on the frame, there is no record in the host_local
table. Is that expected?
In the test scenario I had only one rqd node, and I'm speculating a bit here, but it looked like cuebot did not realise the frame was being worked on. It tried to launch it on a node. There is only one active node, so it tries to launch it on the same node it's actually already running on. That node returns an error, and cuebot asks it to kill the frame.
I believe I've seen this pattern, as well as the pattern you just outlined. I think I got confused because I was not always seeing the same messages in the rqd log when the frame failed. Sometimes it would receive a launch request which would provoke the error and result in a kill request, and sometimes it would seem to just receive the kill request for no obvious reason.
It sounds as though the 'no obvious reason' could have been prompted by rqd reporting back to cuebot, as you've outlined.
Thanks @bcipriano
I tried another sleep 1000
job to check out the state of the database. Attaching a Cuebot log here for you to compare against.
While the job was running I had all of the appropriate rows in frame
, host
, and proc
tables. I did not have a record in the host_local
table. I believe that table is only used for "local dispatch" jobs, which TBH I'm not too familiar with.
A few notes on that log file:
failed to obtain information for proc running on frame
and stopping frame
messages at the beginning appear to be because I retried a previous job, rather than launching a new one. I think they can be ignored.starting frame
and creating proc
messages when each frame starts.verified testing-sh01-cipriano_sleep1/0001-sleep on 172.17.0.3 by grace period
are because Cuebot skips the host verification I outlined above for new frames. This is why your frames aren't getting killed right away.172.17.0.3 doesn't have enough idle cores, 0 needs 10
just means that the host doesn't have space for any more frames other than the running ones, also not an issue.!!!! Inside getPROCS!!!!!
messages looked alarming but it looks like those are just debugging messages that we missed during some merge. They don't indicate an actual error.After the job finished my proc
table was empty.
What is the content of your proc
table after the job has been running for like a minute? I'm guessing that that table is not being populated correctly?
Every frame that runs for longer than 5 minutes gets killed by
rqd
at 5-minutes-and-change.I was trying to track this down in our (barely customized) fork of OpenCue, but I tested with a vanilla dockerized OpenCue and got the same result. Here's a log from that system for a frame running under
rqd
. The command is simply/bin/sleep 1000
:I speculated that
rqd
is killing frames that stop producing output on stdout/stderr for some time, so mostly I've been testing with a Python script that prints out progress tokens at 10-second intervals. It fails in the same way at roughly the same time.I can add more information (rqd and cuebot log output, etc) to the ticket, but first of all... can someone else please try to reproduce this behaviour on a standard vanilla setup?