Open haozturk opened 1 year ago
Saqib let me know that the testbed will not work, because the testbed agent isn't connected to the global pool. I submitted the same workflow in prod which should work: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221207_154734_8291
The request is picked up by cmsgwms-submit8.fnal.gov
but I don't see jobs created or injected. Not sure why. Can this agent work w/ T2_CH_CERN_P5
in principle?
Hi @amaltaro @todor-ivanov can you please check this workflow doesn't move in submit8?
hi @haozturk As discussed during the meeting this workflow has already landed on submit8 which is having problems. But still managed to materialize 350 jobs from the WMcore queue into condor jobs:
[cmsdataops@cmsgwms-submit8 current]$ condor_q -const 'WMAgent_RequestName == "haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221207_154734_8291"'
Total for query: 350 jobs; 0 completed, 0 removed, 350 idle, 0 running, 0 held, 0 suspended
and taking and analyzing one of them:
[cmsdataops@cmsgwms-submit8 current]$ condor_q -better 366382.56
...
Job 366382.056 defines the following attributes:
ExtraMemory = 500
JobCpus = ((JobStatus =!= 1) && (JobStatus =!= 5) && !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)) ? int(MATCH_EXP_JOB_GLIDEIN_Cpus) : OriginalCpus
JobStatus = 1
MaxCores = 4
MinCores = 2.0
OriginalCpus = 4
OriginalMemory = 7900
RequestCpus = WMCore_ResizeJob ? ( !isUndefined(Cpus) ? RequestResizedCpus : JobCpus) : OriginalCpus
RequestDisk = 5000000
RequestMemory = OriginalMemory + ExtraMemory * (WMCore_ResizeJob ? (RequestCpus - OriginalCpus) : 0)
RequestResizedCpus = (Cpus > MaxCores) ? MaxCores : ((Cpus < MinCores) ? MinCores : Cpus)
REQUIRED_ARCH = "X86_64"
WMCore_ResizeJob = false
The Requirements expression for job 366382.056 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 142139 stringListMember(TARGET.Arch,REQUIRED_ARCH)
[1] 142139 TARGET.OpSys == "LINUX"
[3] 70839 TARGET.Disk >= RequestDisk
[5] 69910 TARGET.Memory >= RequestMemory
[6] 34009 [3] && [5]
[7] 69517 TARGET.Cpus >= RequestCpus
[8] 26381 [6] && [7]
366382.056: Run analysis summary ignoring user priority. Of 33408 machines,
343 are rejected by your job's requirements
Just a guess here: Those 343 sound like slots at the correct destination but failing to match job requirements. Maybe SI needs to check how those are build for this workflow.
hi @haozturk
MoniT is now showing successful completion of production jobs on this new site name. Workflows were queued on cmsgwms-submit8.fnal.gov
. Can you please verify and also resume further necessary tests (if needed). thanks
Hi @saqibhaleem thanks for the update. Things look good from our side. I don't see a reason for further testing. Feel free to scale up.
HLT resources are being shifted under this site name and SI informed us about readiness of the site. I submitted the following workflow as a test before enabling it for prod:
https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221201_141613_9953
let's see how it goes.