dmwm / CRABServer

15 stars 38 forks source link

CRABServer REST should not talk to CRIC #6917

Open belforte opened 2 years ago

belforte commented 2 years ago

better to avoid talking to any external service from the REST supporting proper authentication and debugging problems is too much of a pain IIRC this access is only used to fill the site whitelist for MC with "all sites". Which in principle can be done in TW. I looked at this some time ago and change seemed too big to be worth. But now I think that it is worth the effort. @mapellidario FYI

belforte commented 2 years ago

we just had a small storm of CRABServer failures over 1h, all due to errors in talking to CRIC. So I am increasing priority

belforte commented 2 years ago

first thing should be to make this capable to deal with wildcards in the site list https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/TaskWorker/Actions/DagmanCreator.py#L777-L781

then will worry about modifying REST to pass the list with the *'s in it

belforte commented 2 years ago

current code in REST relies on https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/CRABInterface/RESTUserWorkflow.py#L44 and https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/CRABInterface/Utilities.py#L172-L176 to get list of sites from CRIC every 30min and cache in memory. (By the way that's hard to understand since WMCore CRIC class already has a 1h default cache inside... oh well...)

TaskWorker actions are done in independente processes, so it woule make sense to reuse _expandSites but cache site list on a disk file instead (like we used to do with SiteDB info long time ago), list of sites from CRIC does not need to be refreshed any faster than once a day !! Anyhow.. since it is one call per task.. we may even do it every time, rate is low, is only a matter of riding outages. There should be two times set:

current caching reduces number of calls to external service, but makes things fail miserably if server is down when cache expires.

Should definitely combine with the call to CRIC in DataDiscovery and have a single cache file, ref. https://github.com/dmwm/CRABServer/issues/6946 Or at least a common access method with the refresh+use-stale policy.

belforte commented 2 years ago

Let's move info from https://github.com/dmwm/CRABServer/issues/6946 inside here, to simplify tracking from TW log on Jan 3, 2022

2022-01-03 15:55:02,724:INFO:DBSDataDiscovery:Looking up data location with Rucio in cms scope.
2022-01-03 15:55:03,132:DEBUG:DataDiscovery: Formatting data discovery output 
2022-01-03 15:56:46,259:ERROR:DataDiscovery:Impossible translating ['T2_US_UCSD', 'T2_PK_NCP', 'T2_RU_IHEP', 'T2_UA_KIPT', 'T1_FR_CCIN2P3_Disk', 'T1_ES_PIC_Disk', 'T3_KR_UOS', 'T2_AT_Vienna', 'T1_US_FNAL_Disk', 'T2_FR_IPHC', 'T3_US_Colorado', 'T2_IT_Bari', 'T3_TW_NTU_HEP', 'T2_UK_SGrid_RALPP', 'T3_IT_Trieste', 'T2_BR_SPRACE', 'T1_DE_KIT_Disk', 'T2_US_Caltech', 'T2_UK_London_Brunel', 'T2_IT_Legnaro', 'T2_IT_Rome', 'T2_CH_CSCS', 'T2_BE_UCL', 'T2_GR_Ioannina', 'T3_KR_KNU', 'T2_UK_London_IC', 'T3_US_UMiss', 'T2_UK_SGrid_Bristol', 'T1_IT_CNAF_Disk', 'T2_HU_Budapest', 'T0_CH_CERN_Disk', 'T2_US_MIT', 'T3_CH_PSI', 'T1_UK_RAL_Disk', 'T2_US_Caltech_Ceph', 'T3_BG_UNI_SOFIA', 'T2_RU_JINR', 'T2_BR_UERJ', 'T3_US_NotreDame', 'T2_FR_GRIF_LLR', 'T2_ES_IFCA', 'T2_US_Wisconsin', 'T3_FR_IPNL', 'T3_US_NERSC', 'T2_FR_GRIF_IRFU', 'T2_FI_HIP', 'T2_PL_Swierk', 'T3_US_Rutgers', 'T2_TR_METU', 'T3_US_MIT', 'T2_US_Nebraska', 'T2_KR_KISTI', 'T2_CN_Beijing', 'T2_EE_Estonia', 'T3_US_Baylor', 'T2_US_Florida', 'T1_RU_JINR_Disk', 'T2_US_Vanderbilt', 'T2_DE_DESY', 'T2_BE_IIHE', 'T2_RU_INR', 'T2_US_Purdue', 'T2_CH_CERN', 'T2_IT_Pisa', 'T3_US_FNALLPC', 'T2_DE_RWTH', 'T2_ES_CIEMAT', 'T3_US_CMU', 'T2_PT_NCG_Lisbon', 'T2_FR_CCIN2P3', 'T3_KR_KISTI'] to a CMS name through CMS Resource Catalog
2022-01-03 15:56:46,264:ERROR:DataDiscovery:got this exception:
 (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ')
2022-01-03 15:56:46,397:ERROR:Handler:Problem handling 220103_144838:cmsbot_crab_outputFiles because of (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ') failure, traceback follows
Traceback (most recent call last):
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/Handler.py", line 80, in executeAction
    output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 243, in execute
    result = self.executeInternal(*args, **kwargs)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 462, in executeInternal
    result = self.formatOutput(task=kwargs['task'], requestname=self.taskName,
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DataDiscovery.py", line 62, in formatOutput
    wmfile['locations'] = resourceCatalog.PNNstoPSNs(locations[wmfile['block']])
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 159, in PNNstoPSNs
    mapping = self._CRICSiteQuery(callname='data-processing')
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 91, in _CRICSiteQuery
    sitenames = self._getResult(uri, callname=callname, args=extraArgs)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 64, in _getResult
    data = self.refreshCache(cachedApi, apiUrl)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Service.py", line 206, in refreshCache
    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Service.py", line 279, in getData
    data, dummyStatus, dummyReason, from_cache = self["requests"].makeRequest(uri=url,
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 159, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 176, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/Utils/PortForward.py", line 69, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 283, in request
    curl.perform()
pycurl.error: (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ')
belforte commented 2 years ago

maybe a good topic for @mapellidario next month ? I am not comfortable with named ntuple and decorators (see https://github.com/dmwm/CRABServer/blob/2ed6f598b6ff5ca0812d9430aaadff40d5732101/src/python/CRABInterface/Utilities.py#L164)

belforte commented 2 years ago

On Hold waiting for decision on who will work on it

mapellidario commented 2 years ago

I will need some guidance (as always), but I will happily take this!

belforte commented 2 years ago

thanks Dario. Certainly ! Let's assume that we can get to this sometimes in March

belforte commented 2 years ago

we forgot to remove onhold label, doing it now