CMSCompOps / WmAgentScripts

CMS Workflow Team Scripts
7 stars 51 forks source link

Test T2_CH_CERN_P5 for production #1101

Open haozturk opened 1 year ago

haozturk commented 1 year ago

HLT resources are being shifted under this site name and SI informed us about readiness of the site. I submitted the following workflow as a test before enabling it for prod:

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221201_141613_9953

let's see how it goes.

haozturk commented 1 year ago

Saqib let me know that the testbed will not work, because the testbed agent isn't connected to the global pool. I submitted the same workflow in prod which should work: https://cmsweb.cern.ch/reqmgr2/fetch?rid=haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221207_154734_8291

haozturk commented 1 year ago

The request is picked up by cmsgwms-submit8.fnal.gov but I don't see jobs created or injected. Not sure why. Can this agent work w/ T2_CH_CERN_P5 in principle?

haozturk commented 1 year ago

Hi @amaltaro @todor-ivanov can you please check this workflow doesn't move in submit8?

todor-ivanov commented 1 year ago

hi @haozturk As discussed during the meeting this workflow has already landed on submit8 which is having problems. But still managed to materialize 350 jobs from the WMcore queue into condor jobs:

[cmsdataops@cmsgwms-submit8 current]$ condor_q  -const 'WMAgent_RequestName == "haozturk_task_HIG-RunIISummer20UL17wmLHEGEN-Backfill-04142__v1_T_221207_154734_8291"'
Total for query: 350 jobs; 0 completed, 0 removed, 350 idle, 0 running, 0 held, 0 suspended 

and taking and analyzing one of them:

[cmsdataops@cmsgwms-submit8 current]$ condor_q -better 366382.56
...
Job 366382.056 defines the following attributes:
    ExtraMemory = 500
    JobCpus = ((JobStatus =!= 1) && (JobStatus =!= 5) &&  !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) =!= error)) ? int(MATCH_EXP_JOB_GLIDEIN_Cpus) : OriginalCpus
    JobStatus = 1
    MaxCores = 4
    MinCores = 2.0
    OriginalCpus = 4
    OriginalMemory = 7900
    RequestCpus = WMCore_ResizeJob ? ( !isUndefined(Cpus) ? RequestResizedCpus : JobCpus) : OriginalCpus
    RequestDisk = 5000000
    RequestMemory = OriginalMemory + ExtraMemory * (WMCore_ResizeJob ? (RequestCpus - OriginalCpus) : 0)
    RequestResizedCpus = (Cpus > MaxCores) ? MaxCores : ((Cpus < MinCores) ? MinCores : Cpus)
    REQUIRED_ARCH = "X86_64"
    WMCore_ResizeJob = false

The Requirements expression for job 366382.056 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]      142139  stringListMember(TARGET.Arch,REQUIRED_ARCH)
[1]      142139  TARGET.OpSys == "LINUX"
[3]       70839  TARGET.Disk >= RequestDisk
[5]       69910  TARGET.Memory >= RequestMemory
[6]       34009  [3] && [5]
[7]       69517  TARGET.Cpus >= RequestCpus
[8]       26381  [6] && [7]

366382.056:  Run analysis summary ignoring user priority.  Of 33408 machines,
    343 are rejected by your job's requirements

Just a guess here: Those 343 sound like slots at the correct destination but failing to match job requirements. Maybe SI needs to check how those are build for this workflow.

saqibhaleem commented 1 year ago

hi @haozturk

MoniT is now showing successful completion of production jobs on this new site name. Workflows were queued on cmsgwms-submit8.fnal.gov. Can you please verify and also resume further necessary tests (if needed). thanks

haozturk commented 1 year ago

Hi @saqibhaleem thanks for the update. Things look good from our side. I don't see a reason for further testing. Feel free to scale up.