Closed marianne013 closed 2 years ago
What are the lines before this traceback?
2022-10-12 12:53:06 UTC WorkloadManagement/PilotManager NOTICE: Executing action ([2a0c:5bc0:c8:2:b696:91ff:fea3:3e4c]:48502)[dirac_admin:daniela.bauer] RPC/killPilot(<masked>)
2022-10-12 12:53:07 UTC WorkloadManagement/PilotManager NOTICE: Returning response ([2a0c:5bc0:c8:2:b696:91ff:fea3:3e4c]:48502)[dirac_admin:daniela.bauer] (0.25 secs) ERROR: Failed to kill at least some pilots
Is it for each and every pilot? For HTCondor pilots (all)? Or for a specific HTCondor endpoint?
Condor issue as far as I can tell - I tried a different condor ce with the same result. I didn't get an error for ARC:
(base) lx04:2022_Sep_30_1447_gridpp_py3 > dirac-admin-kill-pilot gsiftp://arc-ce02.gridpp.rl.ac.uk:2811/jobs/aKsKDmTrq21nc1XDjqYugZkqABFKDmABFKDmTVbWDmABFKDm01f3Kn
(base) lx04:2022_Sep_30_1447_gridpp_py3 > dirac-admin-kill-pilot htcondorce://cccondorce03.in2p3.fr/1187300.0
Failed to kill pilot htcondorce://cccondorce03.in2p3.fr/1187300.0
Failed to kill at least some pilots
Though (at least according to the web portal) the ARC job wasn't actually killed (it still shows up as submitted), but maybe I am too impatient.
Just to say that the ARC CE job now shows up as "unknown". Hrmpf.
I just tried this on the certification machine. There I didn't see an error, but when accidentally trying to kill the same pilot twice I got:
2022-10-13 09:17:51 UTC WorkloadManagement/PilotManager NOTICE: Executing action ([2001:1458:d00:2d::100:1a4]:42470)[dirac_admin:dbauer] RPC/killPilot(<masked>)
2022-10-13 09:17:51 UTC WorkloadManagement/PilotManager/ce503.cern.ch WARN: Failed to kill pilot htcondorce://ce503.cern.ch/80838331.0: ,
Job 80838331.0 not found
2022-10-13 09:17:51 UTC WorkloadManagement/PilotManager NOTICE: Returning response ([2001:1458:d00:2d::100:1a4]:42470)[dirac_admin:dbauer] (0.18 secs) ERROR: Failed to kill at least some pilots
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/concurrent/futures/thread.py", line 83, in _worker
work_item.run()
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 344, in _processInThread
result = self._processProposal(trid, proposalTuple, handlerObj)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 531, in _processProposal
result = self._executeAction(trid, proposalTuple, handlerObj)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/private/Service.py", line 551, in _executeAction
response = handlerObj._rh_executeAction(proposalTuple)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 124, in _rh_executeAction
retVal = self.__doRPC(actionTuple[1])
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 255, in __doRPC
return self.__RPCCallFunction(method, args)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 296, in __RPCCallFunction
uReturnValue = oMethod(*args)
File "/opt/dirac/versions/v8.1.0a2-1665643118/Linux-x86_64/lib/python3.9/site-packages/DIRAC/WorkloadManagementSystem/Service/PilotManagerHandler.py", line 364, in export_killPilot
return S_ERROR("Failed to kill at least some pilots")
For that we can't do much, but at least that pointed me to this fix: https://github.com/DIRACGrid/DIRAC/pull/6422
I think I can't do much better than what I coded above.
Yes, I realize the subject line is a bit vague, but then so is the error I see.
results in the error below in PilotManager. I don't even know where to start with that one. This is the production server running v7.3.26
@martynia At least I can reproduce your issue :-D