DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

GPU matching does not seem to work for jobs in the DUNE global pool #154

Closed Andrew-McNab-UK closed 1 month ago

Andrew-McNab-UK commented 3 months ago

jobsub_submit -G dune --global-pool dune -N 10 --lines='+RequestGPUs=1' file://gpuquick.sh

produces jobs that stay idle and never run. Without the --global-pool dune the jobs start within minutes and run successfully (at QMUL or Manchester.)

StevenCTimm commented 3 months ago

This has been traced to the fact that there is no "dunegpu" group in the global pool. Should be straightforward to add it. https://fermi.servicenowservices.com/sc_req_item.do?sys_id=ed94e3e987914650ee0a86e7cebb35a4&sysparm_view=ess&sysparm_record_target=sc_req_item&sysparm_record_row=1&sysparm_record_rows=17&sysparm_record_list=active%3Dtrue%5Erequest.requested_forDYNAMIC107a23f36f9c394032544d1fde3ee43b%5Erequest.requested_for%3D32b4c7270a0a3c590054cb9fd2a1c689%5EORrequest.requested_for.manager%3D32b4c7270a0a3c590054cb9fd2a1c689%5EORcat_item%21%3D0366e9a11b3ae01084150e9ee54bcb86%5Eu_security_related%3Dfalse%5EORwatch_listCONTAINS32b4c7270a0a3c590054cb9fd2a1c689%5EORDERBYDESCopened_at

(RITM2053253) is filed.

StevenCTimm commented 3 months ago

there is now a dunegpu group in the dunegpfrontend01 but it's not actually returning any factory entries yet, more work is needed.

StevenCTimm commented 2 months ago

This is now understood why things haven't been matching (a global entry requiring all groups to match stringlistimember("dune", GLIDEIN_Supported_VOs) dunegpu entries don't have a "DUNE" in GLIDEIN_SUPPORTED_VOs they have "DUNEGPU" instead. So the global expression has to be adjusted. Nick will do this at his convenience.

StevenCTimm commented 2 months ago

dunegpu glideins now being delivered to the DUNE global pool. Andrew's test jobs matched and then held immediately due to lack of credentials.

Andrew-McNab-UK commented 1 month ago

So I can successfully submit a job by hand with condor_submit to the RAL schedds as dunejustin and have it run on a GPU machine at Manchester and see that CUDA_VISIBLE_DEVICES set. Thanks!