No frame will run for longer than ~5 minutes

donalm commented 2 years ago

Every frame that runs for longer than 5 minutes gets killed by rqd at 5-minutes-and-change.

I was trying to track this down in our (barely customized) fork of OpenCue, but I tested with a vanilla dockerized OpenCue and got the same result. Here's a log from that system for a frame running under rqd. The command is simply /bin/sleep 1000:

===========================================================
RenderQ JobSpec      Fri Sep 16 14:48:13 2022 

proxy                rqd.rqnetwork.RunningFrame/ea49ae9f-48df-4513-82d0-3b1216dff95c -t:tcp -h 172.18.0.5 -p 10021
command              /bin/sleep 1000
uid                  38859
gid                  20
logDestination       /tmp/rqd/logs/testtwo/test_shot/logs/testtwo-test_shot-donal_helloworld5--0e4f18fd-a1ae-4277-a343-0c608eb96e65/testtwo-test_shot-donal_helloworld5.0001-test_layer.rqlog
cwd                  /tmp/testtwo-test_shot-donal_helloworld5/0001-test_layer
renderHost           172.18.0.5
jobId                0e4f18fd-a1ae-4277-a343-0c608eb96e65
frameId              ea49ae9f-48df-4513-82d0-3b1216dff95c
env                  CUE3=1
env                  CUE_CHUNK=1
env                  CUE_FRAME=0001-test_layer
env                  CUE_FRAME_ID=ea49ae9f-48df-4513-82d0-3b1216dff95c
env                  CUE_GPUS=0
env                  CUE_GPU_MEMORY=0
env                  CUE_IFRAME=1
env                  CUE_JOB=testtwo-test_shot-donal_helloworld5
env                  CUE_JOB_ID=0e4f18fd-a1ae-4277-a343-0c608eb96e65
env                  CUE_LAYER=test_layer
env                  CUE_LAYER_ID=667ebc0c-73f5-4cc7-8ce8-67cc3affaf2a
env                  CUE_LOG_PATH=/tmp/rqd/logs/testtwo/test_shot/logs/testtwo-test_shot-donal_helloworld5--0e4f18fd-a1ae-4277-a343-0c608eb96e65
env                  CUE_MEMORY=3355443
env                  CUE_RANGE=1
env                  CUE_SHOT=test_shot
env                  CUE_SHOW=testtwo
env                  CUE_THREADABLE=0
env                  CUE_THREADS=1
env                  CUE_USER=donal
env                  HOME=/net/homedirs/donal
env                  LOGNAME=donal
env                  MAIL=/usr/mail/donal
env                  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
env                  SP_NOMYCSHRC=1
env                  TERM=unknown
env                  TZ=PST8PDT
env                  USER=donal
env                  frame=0001-test_layer
env                  jobhost=172.18.0.5
env                  jobid=testtwo-test_shot-donal_helloworld5
env                  logfile=testtwo-test_shot-donal_helloworld5.0001-test_layer.rqlog
env                  maxframetime=0
env                  mcp=1
env                  minspace=200
env                  shot=test_shot
env                  show=testtwo
env                  zframe=0001-test_layer
===========================================================

===========================================================
RenderQ Job Complete

exitStatus          1
exitSignal          9
killMessage         OpenCue could not verify this frame.
startTime           Fri Sep 16 14:48:13 2022
endTime             Fri Sep 16 14:53:59 2022
maxrss              7760
maxUsedGpuMemory    0
utime               0
stime               0
renderhost          172.18.0.5
===========================================================

I speculated that rqd is killing frames that stop producing output on stdout/stderr for some time, so mostly I've been testing with a Python script that prints out progress tokens at 10-second intervals. It fails in the same way at roughly the same time.

I can add more information (rqd and cuebot log output, etc) to the ticket, but first of all... can someone else please try to reproduce this behaviour on a standard vanilla setup?

bcipriano commented 2 years ago

I gave this a try in my local environment. It's not exactly a vanilla setup -- it's master plus a bunch of other fixes I have in progress at the moment to get things working on my M1 mac. But a test job with sleep 1000 finished as expected.

donalm commented 2 years ago

I just tried this again with the dockerized config on another machine and it failed before hitting 6 minutes, so I'm becoming more confident that it's a bug in 0.15.22 (and I suspect it's still in master).

Assuming that this has been fixed in your branch - can you please give me any guidance on when that might get merged Brian? Or if that branch is public I could check it out myself?

bcipriano commented 2 years ago

Just noticed -- I see in that log you posted:

killMessage         OpenCue could not verify this frame

That message means that Cuebot killed the frame because it thought the host shouldn't be working on that frame anymore.

In the database, there's a record of every host and proc, and what is assigned to each proc.
RQD sends periodic reports on what each proc is working on.
When Cuebot receives that report, for each proc it checks the database to see if that proc is working on the correct frame. If it's not, it sends a request to RQD to kill the frame.

https://github.com/AcademySoftwareFoundation/OpenCue/blob/cd0fc3d9c32c1af30efe02409c765b26e9947ce4/cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java#L661-L662

https://github.com/AcademySoftwareFoundation/OpenCue/blob/cd0fc3d9c32c1af30efe02409c765b26e9947ce4/cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java#L725-L728

So now the question is, why is Cuebot reassigning that frame?

There are some scheduler changes in master now that have come since that last release, like #1103. I'd be interested to know if Cuebot from master shows the same behavior.

@DiegoTavares Any thoughts here? Seen this before?

bcipriano commented 2 years ago

BTW I'm working to get a new release out ASAP. A few issues have unfortunately slowed this down -- #1192 in particular has been breaking our CI pipelines that we use to test and release, and #1204 has been slowing me down in particular. Hoping to get these all resolved soon.

donalm commented 2 years ago

I noticed in earlier tests that when the rqd node is working on the frame, there is no record in the host_local table. Is that expected?

In the test scenario I had only one rqd node, and I'm speculating a bit here, but it looked like cuebot did not realise the frame was being worked on. It tried to launch it on a node. There is only one active node, so it tries to launch it on the same node it's actually already running on. That node returns an error, and cuebot asks it to kill the frame.

I believe I've seen this pattern, as well as the pattern you just outlined. I think I got confused because I was not always seeing the same messages in the rqd log when the frame failed. Sometimes it would receive a launch request which would provoke the error and result in a kill request, and sometimes it would seem to just receive the kill request for no obvious reason.

It sounds as though the 'no obvious reason' could have been prompted by rqd reporting back to cuebot, as you've outlined.

Thanks @bcipriano

bcipriano commented 2 years ago

I tried another sleep 1000 job to check out the state of the database. Attaching a Cuebot log here for you to compare against.

cuebot_during_sleep_job.log

While the job was running I had all of the appropriate rows in frame, host, and proc tables. I did not have a record in the host_local table. I believe that table is only used for "local dispatch" jobs, which TBH I'm not too familiar with.

A few notes on that log file:

The failed to obtain information for proc running on frame and stopping frame messages at the beginning appear to be because I retried a previous job, rather than launching a new one. I think they can be ignored.
I can see the normal starting frame and creating proc messages when each frame starts.
The verified testing-sh01-cipriano_sleep1/0001-sleep on 172.17.0.3 by grace period are because Cuebot skips the host verification I outlined above for new frames. This is why your frames aren't getting killed right away.
172.17.0.3 doesn't have enough idle cores, 0 needs 10 just means that the host doesn't have space for any more frames other than the running ones, also not an issue.
The !!!! Inside getPROCS!!!!! messages looked alarming but it looks like those are just debugging messages that we missed during some merge. They don't indicate an actual error.
After the frames finish I see Cuebot cleaning up the frames/jobs/procs as expected.

After the job finished my proc table was empty.

What is the content of your proc table after the job has been running for like a minute? I'm guessing that that table is not being populated correctly?

AcademySoftwareFoundation / OpenCue

No frame will run for longer than ~5 minutes #1202