Closed StevenCTimm closed 3 months ago
So I was able to submit jobs via jobsub on dunegpschedd01 and dunegpschedd02 and those do match glideinwms glideins running at FermiGrid. So there's nothing wrong with the glideins.. probably some condor setting changed somehow on Justin-prod-schedd. But then #177 happened and the global pool can't see the Justin schedd's at all right now so that has to be fixed before this problem with not matching can be further debugged.
Justin-prod-sched01 is back but at the moment there are no jobs requesting FNAL_GPGrid (US_FNAL_FermiGrid)
This is now fixed, turned out there were missing commands in the Fermi base docker image which our glideins needed. They have been added now.
AWT shows no jobs running at FNAL_FermiGrid for 17 days.
Investigation shows that (1) the AWT jobs are being submitted correctly with the correct DESIRED_Sites (2) glideins are being submitted by glideinwms to this resource and are running (3) the AWT jobs never match the glidein for reasons as yet not understood.
condor_q -better -name justin-prod-sched01.dune.hep.ac.uk 232047.0 -reverse -machine slot1@glidein_8_164690344@dunegli-3799322-0-fnpc19110.fnal.gov
Step Matched Condition
[15] 1 stringListsIntersect(MY.SLOT_BAD_JOBSUB_GROUPS,TARGET.Jobsub_Group,",") == false [22] never GLIDEIN_ToDie - MyCurrentTime [23] 1 JOB_EXPECTED_MAX_LIFETIME < (GLIDEIN_ToDie - MyCurrentTime) [27] 1 isUndefined(TARGET.RequestGPUs) [28] 1 isUndefined(TARGET.JobFactoryType) [30] 0 isUndefined(DESIRED_SITES) [31] 0 stringlistimember("T3_US_NERSC",TARGET.DESIRED_Sites) [32] 1 [30] || [31] [37] now CurrentTime < GLIDEIN_ToRetire [40] always true [42] 1 WithinResourceLimits
slot1@glidein_8_164690344@dunegli-3799322-0-fnpc19110.fnal.gov: Run analysis summary of 1 jobs. 0 (0.00 %) match both slot and job requirements. 1 match the requirements of this slot. 0 have job requirements that match this slot.