AcademySoftwareFoundation / OpenCue

A render management system you can deploy for visual effects and animation productions.
https://www.opencue.io
Apache License 2.0
832 stars 202 forks source link

[cuebot] Jobs without `os` set, will not dispatch #1591

Open lithorus opened 3 days ago

lithorus commented 3 days ago

Describe the bug If the os parameter is not set, cuebot will not dispatch frames from the job

Setting the str_os field in the database to non-null value will make it dispatch frames to rqd.

DiegoTavares commented 2 days ago

Fixed by https://github.com/AcademySoftwareFoundation/OpenCue/pull/1590

lithorus commented 1 day ago

I'll have to disagree that this is not fixed by #1590. I did already test with that fix in place.

lithorus commented 1 day ago

This is what I get when it tries to dispatch a job :

2024-11-20 22:11:47.945  INFO 16748 --- [pool-1-thread-1] c.i.spcue.dispatcher.CoreUnitDispatcher  : Frames found: 1 for host 192.168.31.160 652/10801152 on job testing-test-jimmy_samurai
2024-11-20 22:11:47.961  INFO 16748 --- [pool-1-thread-1] c.i.s.dispatcher.DispatchSupportService  : creating proc 192.168.31.160 for 0001-layer1
2024-11-20 22:11:47.978  INFO 16748 --- [pool-1-thread-1] c.i.spcue.dispatcher.CoreUnitDispatcher  : dispatchProcToJob failed booking proc 192.168.31.160/39c75ff3-df93-4e25-9203-03b3f91e392f on job testing-test-jimmy_samurai/94baa341-401a-4aaf-bce1-7dab31258b8c

com.imageworks.spcue.dispatcher.DispatcherException: 192.168.31.160 could not be booked on 0001-layer1, java.lang.NullPointerException
    at com.imageworks.spcue.dispatcher.DispatchSupportService.runFrame(DispatchSupportService.java:214) ~[main/:na]
    at com.imageworks.spcue.dispatcher.DispatchSupportService$$FastClassBySpringCGLIB$$39539eb5.invoke(<generated>) ~[main/:na]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:769) ~[spring-aop-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:747) ~[spring-aop-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:366) ~[spring-tx-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:99) ~[spring-tx-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:747) ~[spring-aop-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:689) ~[spring-aop-5.2.1.RELEASE.jar:5.2.1.RELEASE]
    at com.imageworks.spcue.dispatcher.DispatchSupportService$$EnhancerBySpringCGLIB$$c48bb835.runFrame(<generated>) ~[main/:na]
    at com.imageworks.spcue.dispatcher.CoreUnitDispatcher.dispatch(CoreUnitDispatcher.java:392) ~[main/:na]
    at com.imageworks.spcue.dispatcher.CoreUnitDispatcher$1.wrapDispatchFrame(CoreUnitDispatcher.java:310) ~[main/:na]
    at com.imageworks.spcue.dispatcher.CoreUnitDispatcher$DispatchFrameTemplate.execute(CoreUnitDispatcher.java:483) ~[main/:na]
    at com.imageworks.spcue.dispatcher.CoreUnitDispatcher.dispatchHost(CoreUnitDispatcher.java:314) ~[main/:na]
    at com.imageworks.spcue.dispatcher.CoreUnitDispatcher.dispatchJobs(CoreUnitDispatcher.java:176) ~[main/:na]
    at com.imageworks.spcue.dispatcher.CoreUnitDispatcher.dispatchHost(CoreUnitDispatcher.java:235) ~[main/:na]
    at com.imageworks.spcue.dispatcher.commands.DispatchBookHost$1.wrapDispatchCommand(DispatchBookHost.java:106) ~[main/:na]
    at com.imageworks.spcue.dispatcher.commands.DispatchCommandTemplate.execute(DispatchCommandTemplate.java:36) ~[main/:na]
    at com.imageworks.spcue.dispatcher.commands.DispatchBookHost.run(DispatchBookHost.java:117) ~[main/:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
    at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
lithorus commented 1 day ago

I did a trace and this is what I get : cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupportService.java DispatchSupportService > runFrame > rqdClient.launchFrame(prepareRqdRunFrame(proc, frame), proc);

param_1 = {VirtualProc@10068} "192.168.31.160/7c133ad0-91bb-4a96-992e-90f4709bcdfb"
 hostId = "fcc88160-7cad-49de-997d-445dda14f1a3"
 allocationId = "00000000-0000-0000-0000-000000000000"
 frameId = "c84b22e3-bf1b-4ce9-af2f-0a3a205e26a9"
 hostName = "192.168.31.160"
 os = null
 childProcesses = null
 canHandleNegativeCoresRequest = true
 coresReserved = 100
 memoryReserved = 3354624
 memoryUsed = 0
 memoryMax = 0
 virtualMemoryUsed = 0
 virtualMemoryMax = 0
 gpusReserved = 0
 gpuMemoryReserved = 0
 gpuMemoryUsed = 0
 gpuMemoryMax = 0
 unbooked = false
 usageRecorded = false
 isLocalDispatch = false
 layerId = "86bf147f-3709-4398-80ef-d1c0f604a430"
 version = 0
 showId = "00000000-0000-0000-0000-000000000000"
 facilityId = "AAAAAAAA-AAAA-AAAA-AAAA-AAAAAAAAAAA1"
 jobId = "94baa341-401a-4aaf-bce1-7dab31258b8c"
 id = "7c133ad0-91bb-4a96-992e-90f4709bcdfb"
 name = "unknown"
param_2 = {DispatchFrame@10069} "0001-layer1/c84b22e3-bf1b-4ce9-af2f-0a3a205e26a9"
 retries = 0
 state = {FrameState@10086} "WAITING"
 show = "testing"
 shot = "test"
 owner = "jimmy"
 uid = {Optional@10090} "Optional[1000]"
 logDir = "/var/tmp//testing/test/logs/testing-test-jimmy_samurai--94baa341-401a-4aaf-bce1-7dab31258b8c"
 command = "python3 -c "import os;print(os.path.expanduser('~/test'))""
 range = "1-1"
 chunkSize = 1
 layerName = "layer1"
 jobName = "testing-test-jimmy_samurai"
 minCores = 100
 maxCores = 100
 threadable = false
 minGpus = 0
 maxGpus = 0
 minGpuMemory = 0
 services = "blender"
 os = null
 minMemory = 3354624
 softMemoryLimit = 3690086
 hardMemoryLimit = 4696473
 layerId = "86bf147f-3709-4398-80ef-d1c0f604a430"
 version = 8
 showId = "00000000-0000-0000-0000-000000000000"
 facilityId = "AAAAAAAA-AAAA-AAAA-AAAA-AAAAAAAAAAA1"
 jobId = "94baa341-401a-4aaf-bce1-7dab31258b8c"
 id = "c84b22e3-bf1b-4ce9-af2f-0a3a205e26a9"
 name = "0001-layer1"

notice that the os is null in each case and later on in the code it expects it not do be null.

lithorus commented 1 day ago

It fails in cuebot/src/compiled_protobuf/main/java/com/imageworks/spcue/grpc/rqd/RunFrame.java : RunFrame > Builder :

    /**
     * <code>string os = 25;</code>
     * @param value The os to set.
     * @return This builder for chaining.
     */
    public Builder setOs(
        java.lang.String value) {
      if (value == null) {
    throw new NullPointerException();
  }

which expects a string or will fail with a NullPointerException

DiegoTavares commented 1 day ago

(face palm) I'm sorry, I got this issue confused by another issue fixed by the mentioned PR. I'm reopening this.