dmwm / CRABServer

15 stars 38 forks source link

make CRIC access more robust #5800

Open belforte opened 5 years ago

belforte commented 5 years ago

We have lots of errors from CRIC, in a few times our retries are not enough and submissions fail.here's a recent example from https://cmsweb.cern.ch/crabserver/prod/task?subresource=lastfailures

We should create some tally of such errors and follow up with CRIC developers. Anyhow, in view of a resolution, better to make sure we cache long enough, share cache across workers, and retry more

[
    "mratti",
    "181201_110330:mratti_crab_2016_V02_V11_WJetsToLNu_HT-200To400_TuneCUETP8M1_13TeV-madgraphMLM-pythia8",
    "Problem handling 181201_110330:mratti_crab_2016_V02_V11_WJetsToLNu_HT-200To400_TuneCUETP8M1_13TeV-madgraphMLM-pythia8 because of [Errno 111] Connection refused failure, traceback follows\nTraceback (most recent call last):\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/TaskWorker/Actions/Handler.py\", line 77, in executeAction\n    output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/TaskWorker/Actions/DBSDataDiscovery.py\", line 92, in execute\n    result = self.executeInternal(*args, **kwargs)\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/TaskWorker/Actions/DBSDataDiscovery.py\", line 250, in executeInternal\n    tempDir = kwargs['tempDir'])\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/TaskWorker/Actions/DataDiscovery.py\", line 69, in formatOutput\n    wmfile['locations'] = resourceCatalog.PNNstoPSNs(locations[wmfile['block']])\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/WMCore/Services/CRIC/CRIC.py\", line 141, in PNNstoPSNs\n    mapping = self._getResult(uri, callname='data-processing', args=extraArgs)\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/WMCore/Services/CRIC/CRIC.py\", line 55, in _getResult\n    data = self.refreshCache(cachedApi, apiUrl)\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/WMCore/Services/Service.py\", line 205, in refreshCache\n    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType)\n  File \"/data/srv/TaskManager/3.3.1810.patch7/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1810.patch7/lib/python2.7/site-packages/WMCore/Services/Service.py\", line 313, in getData\n    raise he\nerror: [Errno 111] Connection refused\n"

],
belforte commented 5 years ago

should also look at falling back on their EOS cache: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1644/1/1.html