Closed StevenCTimm closed 1 year ago
Will discuss the above with distributed computing group and file tickets accordingly.
From Ken's comments in slack Lincoln is retired, Swan is the new one.
We have jobs pending on "Lincoln" rhino-gw1.unl.edu the entry is not disabled in the factory but nothing is running All jobs on Nebraska (red-gw1, red-gw2) are getting held for having the wrong tokens all jobs on Omaha (crane-gw1) also getting held for having the wrong tokens some jobs on Swan (swan-gw1.) also getting held for having the wrong tokens but some getting through and dying for lack of user namespaces.
Clemson glideins are getting submitted but never running Looks like MWT2/UChicago may actually be the same site
and that MWT2/Uchicago have dropped support for DUNE
3 of 4 BNL gatekeepers are failing.
UConn is working and we don't have that one in our testbed at all
SU-OG is gone
MIT getting no glideins Bec cause it's excluded in the frontend, trying to figure out why and when.
Florida entry is marked in downtime probably has been for quite some time.
This is a summary of the whole US sites:
US_Caltech: Works but we rarely if ever get any slots
US_Clemson: Glideins pending there almost indefinitely
US_Colorado: OK
US_Florida: Has bad CVMFS we asked the factory ops to put it in downtime for us
US_Lincoln: (aka rhino-gw1) supposedly decommissioned, factory glideins pending indefinitely but factory still shows as up
US_Michigan: OK US_MIT: We voluntarily exclude due to high pre-emption US_MWT2: There is no GLIDEIN_DUNESITE by this name anymore. Most of resources are available at US_UChicago US_Nebraska: (red-gw1,red-gw2) all glideins failing to map have to file a ticket US_NotreDame: OK US_Omaha: (crane-gw1, crane-gw2) all glideins failing to map, have to file a ticket US_PuertoRico: OK US_SU-ITS: failing Justin (but not regular jobs) due to lack of user namespaces US_SU-OG: no longer on OSG US_Swan: Failing Justin (but not regular jobs) due to lack of user namespaces.. In addition some tokens getting mismapped and glideins going helpd, need a ticket. US_UChicago: no longer supports DUNE US_WSU: glideins are running but not calling back to user pool, ticket has been filed.
Also UConn doesn't have a glidein_dunesite but it does support DUNE we should add it.
Closing this, we have now individual tasks for all of the above issues.
we now have the capability to do condor_q on the OSG factory and get HeldReason for those sites where all are held.
Using this to investigate,
From gfactory-2.opensciencegrid.org
root@gpfrontend01 group_dune]# condor_q -global -pool gfactory-2.opensciencegrid.org -constraint 'GlideinClient=="gpfrontend01-fnal-gov_gWMSFrontend.dune"' -nobatch -constraint JobStatus==5 -af GlideinEntryName HoldReason | sort | uniq -c WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported. For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/ 49 CMSHTPC_T2_CH_CERN_ce503 Error connecting to schedd ce503.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 46 CMSHTPC_T2_CH_CERN_ce504 Error connecting to schedd ce504.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce505 Error connecting to schedd ce505.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 49 CMSHTPC_T2_CH_CERN_ce506 Error connecting to schedd ce506.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce507 Error connecting to schedd ce507.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 45 CMSHTPC_T2_CH_CERN_ce508 Error connecting to schedd ce508.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 47 CMSHTPC_T2_CH_CERN_ce509 Error connecting to schedd ce509.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce510 Error connecting to schedd ce510.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce511 Error connecting to schedd ce511.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 48 CMSHTPC_T2_CH_CERN_ce512 Error connecting to schedd ce512.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 49 CMSHTPC_T2_CH_CERN_ce513 Error connecting to schedd ce513.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 47 CMSHTPC_T2_CH_CERN_ce514 Error connecting to schedd ce514.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 45 CMSHTPC_T2_CH_CERN_ce515 Error connecting to schedd ce515.cern.ch: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 1 CMSHTPC_T2_US_UCSD_gw6 Job disappeared from remote schedd 6 CMSHTPC_T3_US_NotreDame_deepthought Job disappeared from remote schedd 46 CMSHTPC_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw1_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw2_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 49 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 2 DUNE_UK_Liverpool_hepgrid6 Error connecting to schedd hepgrid6.ph.liv.ac.uk: AUTHENTICATE:1003:Failed to authenticate with any method 48 DUNE_US_BNL_sp01 Error connecting to schedd spce01.sdcc.bnl.gov: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using SCITOKENS 97 HCC_US_BNL_gk01 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 44 HCC_US_BNL_gk02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 HCC_US_Michigan_gate02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 47 HCC_US_Omaha_swan ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 UBoone_T2_UK_Manchester_ce01 ARC job failed: LRMS error: (271) job killed: vmem
From vocms0207.cern.ch
[root@gpfrontend01 group_dune]# condor_q -global -pool vocms0207.cern.ch -constraint 'GlideinClient=="gpfrontend01-fnal-gov_gWMSFrontend.dune"' -nobatch -constraint JobStatus==5 -af GlideinEntryName HoldReason | sort | uniq -c WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported. For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/ 1 CMSHTPC_T2_US_UCSD_gw6 Job disappeared from remote schedd 42 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 DUNE_CA_Victoria_dune-condor_whole Error connecting to schedd dune-condor.heprc.uvic.ca: SECMAN:2007:Failed to received post-auth ClassAd 24 HCC_US_BNL_gk01 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 3 HCC_US_BNL_gk02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account
So what does that give us:
46 CMSHTPC_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw1_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_gw2_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 45 CMS_T2_US_Nebraska_Red_whole_op ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 49 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account
That means that whether or not Omaha and Nebraska are up, they don't take DUNE anymore. Don't think this is worth a ticket but it mibht be.
Swan is supposedly the new one but we are not mapping there either.
We've abandoned hepgrid6 for DUNE, it doesn't take tokens, we are just using hepgrid5 but need to be sure that's working.
This is evident that the CERN htcondor-ce's don't take scitokens yet.. troublesome need some more testing. Jobs are getting through to CERN via the CERN factory.
So there is one (spce01.sdcc.bnl.gov) that doesn't take scitokens, and 2 others that don't take DUNE even though the factory entry says they do.
This one is intermittent.. the next one appeared to get through need to keep an eye on thisl
from vocms0207 1 CMSHTPC_T2_US_UCSD_gw6 Job disappeared from remote schedd 42 CMS_T3_US_Omaha_crane ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 1 DUNE_CA_Victoria_dune-condor_whole Error connecting to schedd dune-condor.heprc.uvic.ca: SECMAN:2007:Failed to received post-auth ClassAd 24 HCC_US_BNL_gk01 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account 3 HCC_US_BNL_gk02 ERROR: Failed to submit job. SCHEDD:2:Setting owner to "dunepilot", which is not a valid user account
2007 is a strange secman error that doesn't often happen could signify some kind of firewall issue.