Open belforte opened 2 years ago
we just had a small storm of CRABServer failures over 1h, all due to errors in talking to CRIC. So I am increasing priority
first thing should be to make this capable to deal with wildcards in the site list https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/TaskWorker/Actions/DagmanCreator.py#L777-L781
then will worry about modifying REST to pass the list with the *'s
in it
current code in REST relies on https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/CRABInterface/RESTUserWorkflow.py#L44 and https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/CRABInterface/Utilities.py#L172-L176 to get list of sites from CRIC every 30min and cache in memory. (By the way that's hard to understand since WMCore CRIC class already has a 1h default cache inside... oh well...)
TaskWorker actions are done in independente processes, so it woule make sense to reuse _expandSites
but cache site list on a disk file instead (like we used to do with SiteDB info long time ago), list of sites from CRIC does not need to be refreshed any faster than once a day !! Anyhow.. since it is one call per task.. we may even do it every time, rate is low, is only a matter of riding outages. There should be two times set:
current caching reduces number of calls to external service, but makes things fail miserably if server is down when cache expires.
Should definitely combine with the call to CRIC in DataDiscovery and have a single cache file, ref. https://github.com/dmwm/CRABServer/issues/6946 Or at least a common access method with the refresh+use-stale policy.
Let's move info from https://github.com/dmwm/CRABServer/issues/6946 inside here, to simplify tracking from TW log on Jan 3, 2022
2022-01-03 15:55:02,724:INFO:DBSDataDiscovery:Looking up data location with Rucio in cms scope.
2022-01-03 15:55:03,132:DEBUG:DataDiscovery: Formatting data discovery output
2022-01-03 15:56:46,259:ERROR:DataDiscovery:Impossible translating ['T2_US_UCSD', 'T2_PK_NCP', 'T2_RU_IHEP', 'T2_UA_KIPT', 'T1_FR_CCIN2P3_Disk', 'T1_ES_PIC_Disk', 'T3_KR_UOS', 'T2_AT_Vienna', 'T1_US_FNAL_Disk', 'T2_FR_IPHC', 'T3_US_Colorado', 'T2_IT_Bari', 'T3_TW_NTU_HEP', 'T2_UK_SGrid_RALPP', 'T3_IT_Trieste', 'T2_BR_SPRACE', 'T1_DE_KIT_Disk', 'T2_US_Caltech', 'T2_UK_London_Brunel', 'T2_IT_Legnaro', 'T2_IT_Rome', 'T2_CH_CSCS', 'T2_BE_UCL', 'T2_GR_Ioannina', 'T3_KR_KNU', 'T2_UK_London_IC', 'T3_US_UMiss', 'T2_UK_SGrid_Bristol', 'T1_IT_CNAF_Disk', 'T2_HU_Budapest', 'T0_CH_CERN_Disk', 'T2_US_MIT', 'T3_CH_PSI', 'T1_UK_RAL_Disk', 'T2_US_Caltech_Ceph', 'T3_BG_UNI_SOFIA', 'T2_RU_JINR', 'T2_BR_UERJ', 'T3_US_NotreDame', 'T2_FR_GRIF_LLR', 'T2_ES_IFCA', 'T2_US_Wisconsin', 'T3_FR_IPNL', 'T3_US_NERSC', 'T2_FR_GRIF_IRFU', 'T2_FI_HIP', 'T2_PL_Swierk', 'T3_US_Rutgers', 'T2_TR_METU', 'T3_US_MIT', 'T2_US_Nebraska', 'T2_KR_KISTI', 'T2_CN_Beijing', 'T2_EE_Estonia', 'T3_US_Baylor', 'T2_US_Florida', 'T1_RU_JINR_Disk', 'T2_US_Vanderbilt', 'T2_DE_DESY', 'T2_BE_IIHE', 'T2_RU_INR', 'T2_US_Purdue', 'T2_CH_CERN', 'T2_IT_Pisa', 'T3_US_FNALLPC', 'T2_DE_RWTH', 'T2_ES_CIEMAT', 'T3_US_CMU', 'T2_PT_NCG_Lisbon', 'T2_FR_CCIN2P3', 'T3_KR_KISTI'] to a CMS name through CMS Resource Catalog
2022-01-03 15:56:46,264:ERROR:DataDiscovery:got this exception:
(35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ')
2022-01-03 15:56:46,397:ERROR:Handler:Problem handling 220103_144838:cmsbot_crab_outputFiles because of (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ') failure, traceback follows
Traceback (most recent call last):
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/Handler.py", line 80, in executeAction
output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 243, in execute
result = self.executeInternal(*args, **kwargs)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 462, in executeInternal
result = self.formatOutput(task=kwargs['task'], requestname=self.taskName,
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DataDiscovery.py", line 62, in formatOutput
wmfile['locations'] = resourceCatalog.PNNstoPSNs(locations[wmfile['block']])
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 159, in PNNstoPSNs
mapping = self._CRICSiteQuery(callname='data-processing')
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 91, in _CRICSiteQuery
sitenames = self._getResult(uri, callname=callname, args=extraArgs)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 64, in _getResult
data = self.refreshCache(cachedApi, apiUrl)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Service.py", line 206, in refreshCache
self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Service.py", line 279, in getData
data, dummyStatus, dummyReason, from_cache = self["requests"].makeRequest(uri=url,
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 159, in makeRequest
result, response = self.makeRequest_pycurl(uri, data, verb, headers)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 176, in makeRequest_pycurl
response, result = self.reqmgr.request(uri, data, headers, verb=verb,
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/Utils/PortForward.py", line 69, in portMangle
return callFunc(callObj, url, *args, **kwargs)
File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 283, in request
curl.perform()
pycurl.error: (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ')
maybe a good topic for @mapellidario next month ? I am not comfortable with named ntuple and decorators (see https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/CRABInterface/Utilities.py#L164)
On Hold waiting for decision on who will work on it
I will need some guidance (as always), but I will happily take this!
thanks Dario. Certainly ! Let's assume that we can get to this sometimes in March
we forgot to remove onhold label, doing it now
better to avoid talking to any external service from the REST supporting proper authentication and debugging problems is too much of a pain IIRC this access is only used to fill the site whitelist for MC with "all sites". Which in principle can be done in TW. I looked at this some time ago and change seemed too big to be worth. But now I think that it is worth the effort. @mapellidario FYI