DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

US_FNAL_Fermigrid no jobs matching #176

Closed StevenCTimm closed 3 months ago

StevenCTimm commented 4 months ago

AWT shows no jobs running at FNAL_FermiGrid for 17 days.

Investigation shows that (1) the AWT jobs are being submitted correctly with the correct DESIRED_Sites (2) glideins are being submitted by glideinwms to this resource and are running (3) the AWT jobs never match the glidein for reasons as yet not understood.

condor_q -better -name justin-prod-sched01.dune.hep.ac.uk 232047.0 -reverse -machine slot1@glidein_8_164690344@dunegli-3799322-0-fnpc19110.fnal.gov

   Clusters

Step Matched Condition


[15] 1 stringListsIntersect(MY.SLOT_BAD_JOBSUB_GROUPS,TARGET.Jobsub_Group,",") == false [22] never GLIDEIN_ToDie - MyCurrentTime [23] 1 JOB_EXPECTED_MAX_LIFETIME < (GLIDEIN_ToDie - MyCurrentTime) [27] 1 isUndefined(TARGET.RequestGPUs) [28] 1 isUndefined(TARGET.JobFactoryType) [30] 0 isUndefined(DESIRED_SITES) [31] 0 stringlistimember("T3_US_NERSC",TARGET.DESIRED_Sites) [32] 1 [30] || [31] [37] now CurrentTime < GLIDEIN_ToRetire [40] always true [42] 1 WithinResourceLimits

slot1@glidein_8_164690344@dunegli-3799322-0-fnpc19110.fnal.gov: Run analysis summary of 1 jobs. 0 (0.00 %) match both slot and job requirements. 1 match the requirements of this slot. 0 have job requirements that match this slot.

StevenCTimm commented 3 months ago

So I was able to submit jobs via jobsub on dunegpschedd01 and dunegpschedd02 and those do match glideinwms glideins running at FermiGrid. So there's nothing wrong with the glideins.. probably some condor setting changed somehow on Justin-prod-schedd. But then #177 happened and the global pool can't see the Justin schedd's at all right now so that has to be fixed before this problem with not matching can be further debugged.

StevenCTimm commented 3 months ago

Justin-prod-sched01 is back but at the moment there are no jobs requesting FNAL_GPGrid (US_FNAL_FermiGrid)

StevenCTimm commented 3 months ago

This is now fixed, turned out there were missing commands in the Fermi base docker image which our glideins needed. They have been added now.