DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Investigating compute sites where no JUSTIN jobs run #49

Closed StevenCTimm closed 1 year ago

StevenCTimm commented 1 year ago

we now have the capability to do condor_q on the OSG factory and get HeldReason for those sites where all are held.

Using this to investigate,

From gfactory-2.opensciencegrid.org

root@gpfrontend01 group_dune]# condor_q -global -pool gfactory-2.opensciencegrid.org -constraint 'GlideinClient=="gpfrontend01-fnal-gov_gWMSFrontend.dune"' -nobatch -constraint JobStatus==5 -af GlideinEntryName HoldReason | sort | uniq -c WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported. For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/ 49 CMSHTPC_T2_CH_CERN_ce503 Error connecting to schedd ce503.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 46 CMSHTPC_T2_CH_CERN_ce504 Error connecting to schedd ce504.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce505 Error connecting to schedd ce505.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 49 CMSHTPC_T2_CH_CERN_ce506 Error connecting to schedd ce506.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce507 Error connecting to schedd ce507.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 45 CMSHTPC_T2_CH_CERN_ce508 Error connecting to schedd ce508.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 47 CMSHTPC_T2_CH_CERN_ce509 Error connecting to schedd ce509.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce510 Error connecting to schedd ce510.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce511 Error connecting to schedd ce511.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce512 Error connecting to schedd ce512.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 49 CMSHTPC_T2_CH_CERN_ce513 Error connecting to schedd ce513.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 47 CMSHTPC_T2_CH_CERN_ce514 Error connecting to schedd ce514.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 45 CMSHTPC_T2_CH_CERN_ce515 Error connecting to schedd ce515.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 1 CMSHTPC_T2_US_UCSD_gw6 Job disappeared from remote schedd 6 CMSHTPC_T3_US_NotreDame_deepthought Job disappeared from remote schedd 46 CMSHTPC_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw1_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw2_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 49 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 2 DUNE_UK_Liverpool_hepgrid6 Error connecting to schedd hepgrid6.ph.liv.ac.uk: AUTHENTICATE:1003:Failed to authenticate with any method 48 DUNE_US_BNL_sp01 Error connecting to schedd spce01.sdcc.bnl.gov: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 97 HCC_US_BNL_gk01 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 44 HCC_US_BNL_gk02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 HCC_US_Michigan_gate02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 47 HCC_US_Omaha_swan ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 UBoone_T2_UK_Manchester_ce01 ARC job failed: LRMS error: (271) job killed: vmem


From vocms0207.cern.ch

[root@gpfrontend01 group_dune]# condor_q -global -pool vocms0207.cern.ch -constraint 'GlideinClient=="gpfrontend01-fnal-gov_gWMSFrontend.dune"' -nobatch -constraint JobStatus==5 -af GlideinEntryName HoldReason | sort | uniq -c WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported. For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/ 1 CMSHTPC_T2_US_UCSD_gw6 Job disappeared from remote schedd 42 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 DUNE_CA_Victoria_dune-condor_whole Error connecting to schedd dune-condor.heprc.uvic.ca: SECMAN:2007:Failed to received post-auth ClassAd 24 HCC_US_BNL_gk01 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 3 HCC_US_BNL_gk02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account


So what does that give us:

46 CMSHTPC_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw1_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw2_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 49 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account

That means that whether or not Omaha and Nebraska are up, they don't take DUNE anymore. Don't think this is worth a ticket but it mibht be.

 46 HCC_US_Omaha_swan ERROR: Failed to submit job.  SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account

Swan is supposedly the new one but we are not mapping there either.

  2 DUNE_UK_Liverpool_hepgrid6 Error connecting to schedd hepgrid6.ph.liv.ac.uk: AUTHENTICATE:1003:Failed to authenticate with any method

We've abandoned hepgrid6 for DUNE, it doesn't take tokens, we are just using hepgrid5 but need to be sure that's working.

 49 CMSHTPC_T2_CH_CERN_ce503 Error connecting to schedd ce503.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 46 CMSHTPC_T2_CH_CERN_ce504 Error connecting to schedd ce504.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 48 CMSHTPC_T2_CH_CERN_ce505 Error connecting to schedd ce505.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 49 CMSHTPC_T2_CH_CERN_ce506 Error connecting to schedd ce506.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 48 CMSHTPC_T2_CH_CERN_ce507 Error connecting to schedd ce507.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 45 CMSHTPC_T2_CH_CERN_ce508 Error connecting to schedd ce508.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 47 CMSHTPC_T2_CH_CERN_ce509 Error connecting to schedd ce509.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 48 CMSHTPC_T2_CH_CERN_ce510 Error connecting to schedd ce510.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 48 CMSHTPC_T2_CH_CERN_ce511 Error connecting to schedd ce511.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 48 CMSHTPC_T2_CH_CERN_ce512 Error connecting to schedd ce512.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 49 CMSHTPC_T2_CH_CERN_ce513 Error connecting to schedd ce513.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 47 CMSHTPC_T2_CH_CERN_ce514 Error connecting to schedd ce514.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 45 CMSHTPC_T2_CH_CERN_ce515 Error connecting to schedd ce515.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS

This is evident that the CERN htcondor-ce's don't take scitokens yet.. troublesome need some more testing. Jobs are getting through to CERN via the CERN factory.

 48 DUNE_US_BNL_sp01 Error connecting to schedd spce01.sdcc.bnl.gov: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
 97 HCC_US_BNL_gk01 ERROR: Failed to submit job.  SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account
 44 HCC_US_BNL_gk02 ERROR: Failed to submit job.  SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account

So there is one (spce01.sdcc.bnl.gov) that doesn't take scitokens, and 2 others that don't take DUNE even though the factory entry says they do.

  1 HCC_US_Michigan_gate02 ERROR: Failed to submit job.  SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account

This one is intermittent.. the next one appeared to get through need to keep an eye on thisl

from vocms0207 1 CMSHTPC_T2_US_UCSD_gw6 Job disappeared from remote schedd 42 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 DUNE_CA_Victoria_dune-condor_whole Error connecting to schedd dune-condor.heprc.uvic.ca: SECMAN:2007:Failed to received post-auth ClassAd 24 HCC_US_BNL_gk01 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 3 HCC_US_BNL_gk02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account

2007 is a strange secman error that doesn't often happen could signify some kind of firewall issue.

StevenCTimm commented 1 year ago

Will discuss the above with distributed computing group and file tickets accordingly.

StevenCTimm commented 1 year ago

From Ken's comments in slack Lincoln is retired, Swan is the new one.

StevenCTimm commented 1 year ago

We have jobs pending on "Lincoln" rhino-gw1.unl.edu the entry is not disabled in the factory but nothing is running All jobs on Nebraska (red-gw1, red-gw2) are getting held for having the wrong tokens all jobs on Omaha (crane-gw1) also getting held for having the wrong tokens some jobs on Swan (swan-gw1.) also getting held for having the wrong tokens but some getting through and dying for lack of user namespaces.

StevenCTimm commented 1 year ago

Clemson glideins are getting submitted but never running Looks like MWT2/UChicago may actually be the same site

StevenCTimm commented 1 year ago

and that MWT2/Uchicago have dropped support for DUNE

StevenCTimm commented 1 year ago

3 of 4 BNL gatekeepers are failing.

StevenCTimm commented 1 year ago

UConn is working and we don't have that one in our testbed at all

StevenCTimm commented 1 year ago

SU-OG is gone

StevenCTimm commented 1 year ago

MIT getting no glideins Bec cause it's excluded in the frontend, trying to figure out why and when.

StevenCTimm commented 1 year ago

Florida entry is marked in downtime probably has been for quite some time.

StevenCTimm commented 1 year ago

This is a summary of the whole US sites:

US_Caltech:  Works but we rarely if ever get any slots
US_Clemson:  Glideins pending there almost indefinitely
US_Colorado: OK
US_Florida:  Has bad CVMFS we asked the factory ops to put it in downtime for us
US_Lincoln:  (aka rhino-gw1) supposedly decommissioned, factory glideins pending indefinitely but factory still shows as up

US_Michigan: OK US_MIT: We voluntarily exclude due to high pre-emption US_MWT2: There is no GLIDEIN_DUNESITE by this name anymore. Most of resources are available at US_UChicago US_Nebraska: (red-gw1,red-gw2) all glideins failing to map have to file a ticket US_NotreDame: OK US_Omaha: (crane-gw1, crane-gw2) all glideins failing to map, have to file a ticket US_PuertoRico: OK US_SU-ITS: failing Justin (but not regular jobs) due to lack of user namespaces US_SU-OG: no longer on OSG US_Swan: Failing Justin (but not regular jobs) due to lack of user namespaces.. In addition some tokens getting mismapped and glideins going helpd, need a ticket. US_UChicago: no longer supports DUNE US_WSU: glideins are running but not calling back to user pool, ticket has been filed.

StevenCTimm commented 1 year ago

Also UConn doesn't have a glidein_dunesite but it does support DUNE we should add it.

StevenCTimm commented 1 year ago

Closing this, we have now individual tasks for all of the above issues.