DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Andrew reports that jobs on Justin schedd's aren't starting properly #185

Open StevenCTimm opened 4 hours ago

StevenCTimm commented 4 hours ago

Requested job ids and error messages if available.

This from dunegpfrontend01 [2024-10-13 15:15:08,524] ERROR: glideinFrontendElement:1886: Failed to talk to factory_pool for global info: Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/glideinwms/lib/condorMonitor.py", line 695, in fetch_using_bindings results = collector.query(adtype, constraint, attrs) File "/usr/lib64/python3.9/site-packages/htcondor/_lock.py", line 70, in wrapper rv = func(*args, **kwargs) htcondor.HTCondorIOError: Failed communication with collector.

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/glideinwms/frontend/glideinFrontendElement.py", line 1877, in query_globals factory_globals_dict = glideinFrontendInterface.findGlobals( File "/usr/lib/python3.9/site-packages/glideinwms/frontend/glideinFrontendInterface.py", line 169, in findGlobals status.load(status_constraint) File "/usr/lib/python3.9/site-packages/glideinwms/lib/condorMonitor.py", line 576, in load self.stored_data = self.fetch(constraint, format_list) File "/usr/lib/python3.9/site-packages/glideinwms/lib/condorMonitor.py", line 675, in fetch return CondorQuery.fetch(self, constraint=constraint, format_list=format_list) File "/usr/lib/python3.9/site-packages/glideinwms/lib/condorMonitor.py", line 506, in fetch raise QueryError(err_str) from ex

StevenCTimm commented 4 hours ago

so there's a problem communicating to the new OSG factory. Will see if anyone's watching on OSG slack over the weekend, and open a ticket.

StevenCTimm commented 4 hours ago

the condor_status -any output from gfactory-1.osg-htc.org shows that it is up but the factory on it is not.

StevenCTimm commented 3 hours ago

https://support.opensciencegrid.org/support/tickets/public/762ff6980c053fcf974e5bc9eb3cf7807a6946123a334a49299a33d275cef755

StevenCTimm commented 3 hours ago

osg ticket 77867 is filed, see above url