Closed Andrew-McNab-UK closed 1 month ago
This has been traced to the fact that there is no "dunegpu" group in the global pool. Should be straightforward to add it. https://fermi.servicenowservices.com/sc_req_item.do?sys_id=ed94e3e987914650ee0a86e7cebb35a4&sysparm_view=ess&sysparm_record_target=sc_req_item&sysparm_record_row=1&sysparm_record_rows=17&sysparm_record_list=active%3Dtrue%5Erequest.requested_forDYNAMIC107a23f36f9c394032544d1fde3ee43b%5Erequest.requested_for%3D32b4c7270a0a3c590054cb9fd2a1c689%5EORrequest.requested_for.manager%3D32b4c7270a0a3c590054cb9fd2a1c689%5EORcat_item%21%3D0366e9a11b3ae01084150e9ee54bcb86%5Eu_security_related%3Dfalse%5EORwatch_listCONTAINS32b4c7270a0a3c590054cb9fd2a1c689%5EORDERBYDESCopened_at
(RITM2053253) is filed.
there is now a dunegpu group in the dunegpfrontend01 but it's not actually returning any factory entries yet, more work is needed.
This is now understood why things haven't been matching (a global entry requiring all groups to match stringlistimember("dune", GLIDEIN_Supported_VOs) dunegpu entries don't have a "DUNE" in GLIDEIN_SUPPORTED_VOs they have "DUNEGPU" instead. So the global expression has to be adjusted. Nick will do this at his convenience.
dunegpu glideins now being delivered to the DUNE global pool. Andrew's test jobs matched and then held immediately due to lack of credentials.
So I can successfully submit a job by hand with condor_submit to the RAL schedds as dunejustin and have it run on a GPU machine at Manchester and see that CUDA_VISIBLE_DEVICES set. Thanks!
jobsub_submit -G dune --global-pool dune -N 10 --lines='+RequestGPUs=1' file://gpuquick.sh
produces jobs that stay idle and never run. Without the --global-pool dune the jobs start within minutes and run successfully (at QMUL or Manchester.)