dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

WMAgent don't find data #3198

Closed samircury closed 10 years ago

samircury commented 12 years ago

For some reason all recent requests I try to inject into the vocms13 WMAgent today are failing with the message :

{{{ Messages: Invalid WMSpec: 'cmsdataops_f-no-jemalloc_120202_131357_6520': Input data has no locations "/L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace" }}}

I don't see any relevant message in the WorkQueue Manager logs. Raised the logging level to DEBUG to see if we spot something.

I didn't see anythign about the request I injected, only :

{{{ install/couchdb/logs/stdout.log:[info] [<0.16501.0>] 128.142.229.57 - - 'GET' /globalwq_inbox/cmsdataops_anytest_120202_132334_3237 404 install/couchdb/logs/stdout.log:[info] [<0.16503.0>] 128.142.229.57 - - 'HEAD' /globalwq/cmsdataops_anytest_120202_132334_3237 404 install/couchdb/logs/stdout.log:[info] [<0.16503.0>] 128.142.229.57 - - 'PUT' /globalwq/cmsdataops_anytest_120202_132334_3237 201 install/couchdb/logs/stdout.log:[info] [<0.16503.0>] 128.142.229.57 - - 'PUT' /globalwq/cmsdataops_anytest_120202_132334_3237/spec?rev=1-b251cdda3e7db2f6b3adf68ed53e0c02 201 install/couchdb/logs/stdout.log:[info] [<0.16501.0>] 128.142.229.57 - - 'HEAD' /globalwq_inbox/cmsdataops_anytest_120202_132334_3237 404 install/couchdb/logs/stdout.log:[info] [<0.16504.0>] 128.142.229.57 - - 'GET' /globalwq/_design/WorkQueue/_view/elementsByParent?key=%22cmsdataops_anytest_120202_132334_3237%22&include_docs=true 200 install/couchdb/logs/stdout.log:[info] [<0.16501.0>] 128.142.229.57 - - 'PUT' /globalwq_inbox/_design/WorkQueue/_update/in-place/cmsdataops_anytest_120202_132334_3237?updates=%7B%22Status%22%3A+%22Failed%22%7D 201 install/couchdb/logs/stdout.log:[info] [<0.16501.0>] 128.142.229.57 - - 'GET' /globalwq_inbox/cmsdataops_anytest_120202_132334_3237 200 install/couchdb/logs/stdout.log:[info] [<0.16504.0>] 128.142.229.57 - - 'GET' /globalwq/_design/WorkQueue/_view/elementsByWorkflow?limit=0&reduce=false&key=%22cmsdataops_anytest_120202_132334_3237%22 200 install/couchdb/logs/stdout.log:[info] [<0.17154.0>] 128.142.229.57 - - 'GET' /reqmgrdb/cmsdataops_anytest_120202_132334_3237/spec 200

}}}

Maybe the 404's tell something

Same requests worked yesterday. Only thing that may have changed is something Oracle related as we had the upgrade yesterday. Maybe DBS? But I guess it should have been transparent.

sfoulkes commented 12 years ago

sfoulkes: That data is definitely at several sites. No other messages in the WorkQueueManager? Did it have any problems talking to PhEDEx?

sfoulkes commented 12 years ago

sfoulkes: Maybe it's possible it did this while the cmsweb upgrade was happening? Can you try another request?

samircury commented 12 years ago

samir: Only concerning messages I got were :

{{{

2134-2012-02-02 13:27:42,517:INFO:WorkQueueManagerLocationPoller:Updating data locations 522218-2012-02-02 13:30:25,525:INFO:BaseWorkerThread:Worker thread <WMComponent.WorkQueueManager.WorkQueueManagerCleaner.WorkQueueManagerCleaner instance at 0x12d49fc8> started 522388-2012-02-02 13:30:25,526:INFO:WorkQueueManagerCleaner:Start updating & cleaning... 522470-2012-02-02 13:30:26,028:INFO:WorkQueueManagerCleaner:Finished updating & cleaning. 522553-2012-02-02 13:33:03,352:INFO:WorkQueueReqMgrInterface:Contacting Request manager for more work 522648:2012-02-02 13:33:03,379:INFO:WorkQueueReqMgrInterface:Processing request cmsdataops_anytest_120202_132334_3237 at http://vocms13.cern.ch:5984/reqmgrdb/cmsdataops_anytest_120202_132334_3237/spec 522842:2012-02-02 13:33:03,380:INFO:WorkQueue:queueWork() begin queueing "http://vocms13.cern.ch:5984/reqmgrdb/cmsdataops_anytest_120202_132334_3237/spec" 522990:2012-02-02 13:33:03,678:INFO:WorkQueue:Splitting cmsdataops_anytest_120202_132334_3237 with policy Block params = {'ResubmitBlock': {'args': {}, 'name': 'ResubmitBlock'}, 'MonteCarlo': {'args': {}, 'name': 'MonteCarlo'}, 'Dataset': {'args': {}, 'name': 'Dataset'}, 'Block': {'args': {'SliceSize': 1500, 'policyName': 'Block', 'SliceType': 'NumberOfEvents'}, 'name': 'Block'}, 'DatasetBlock': {'args': {}, 'name': 'Dataset'}} 523415-2012-02-02 13:33:03,876:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/6093079634102884686_GET_seToCMSName_srm-cms.gridpp.rl.ac.uk.json is stale and the service at SEtoCMSName raised a BadStatusLine('',) when accessed 523737-2012-02-02 13:33:03,934:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/2814624582737769406_GET_seToCMSName_cmssrm.hep.wisc.edu.json is stale and the service at SEtoCMSName raised a BadStatusLine('',) when accessed 524055-2012-02-02 13:33:03,985:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/2258780201460884133_GET_seToCMSName_srm-cms.cern.ch.json is stale and the service at SEtoCMSName raised a BadStatusLine('',) when accessed 524369-2012-02-02 13:33:04,040:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/1182497830268614191_GET_seToCMSName_srm-eoscms.cern.ch.json is stale and the service at SEtoCMSName raised a BadStatusLine('',) when accessed 524686-2012-02-02 13:33:04,095:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/6332884222569658975_GET_seToCMSName_dcache-se-cms.desy.de.json is stale and the service at SEtoCMSName raised a BadStatusLine('',) when accessed 525006-2012-02-02 13:33:04,155:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/3638935438293509632_GET_seToCMSName_cmssrm.fnal.gov.json is stale and the service at SEtoCMSName raised a BadStatusLine('',) when accessed 525320:2012-02-02 13:33:04,346:INFO:WorkQueue:Failing workflow "cmsdataops_anytest_120202_132334_3237": Invalid WMSpec: 'cmsdataops_anytest_120202_132334_3237': Input data has no locations "/L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace" 525567:2012-02-02 13:33:04,371:INFO:WorkQueueReqMgrInterface:Permanent failure processing request "cmsdataops_anytest_120202_132334_3237": Invalid WMSpec: 'cmsdataops_anytest_120202_132334_3237': Input data has no locations "/L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace" 525849:2012-02-02 13:33:04,372:INFO:WorkQueueReqMgrInterface:Marking request cmsdataops_anytest_120202_132334_3237 as failed in ReqMgr 525977-2012-02-02 13:33:04,612:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/128.142.229.57:8687/2266295795689648715_PUT_request does not exist and the service at http://128.142.229.57:8687/reqmgr/reqMgr/request is unavailable - it returned 403 because Forbidden 526323- 526324-2012-02-02 13:33:04,613:ERROR:WorkQueueReqMgrInterface:Error updating ReqMgr about request "cmsdataops_StressTestT0_120131_122115_4326": 526462-2012-02-02 13:33:04,637:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/128.142.229.57:8687/-2234668731621091375_PUT_request does not exist and the service at http://128.142.229.57:8687/reqmgr/reqMgr/request is unavailable - it returned 403 because Forbidden

526809-

528269- 528270-2012-02-02 13:33:04,745:ERROR:WorkQueueReqMgrInterface:Error updating ReqMgr about request "cmsdataops_StressTestT0_5_120131_114038_3494": 528410-2012-02-02 13:33:04,768:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/128.142.229.57:8687/-4490398838844699079_PUT_request does not exist and the service at http://128.142.229.57:8687/reqmgr/reqMgr/request is unavailable - it returned 403 because Forbidden 528757- 528758-2012-02-02 13:33:04,769:ERROR:WorkQueueReqMgrInterface:Error updating ReqMgr about request "cmsdataops_StressTestT0SingleBLockJEMA_120201_172244_5463": 528911:2012-02-02 13:33:04,773:INFO:WorkQueue:Deleting request "cmsdataops_anytest_120202_132334_3237" as it is Failed 529023-2012-02-02 13:37:42,732:INFO:WorkQueueManagerLocationPoller:Updating data locations 529107-2012-02-02 13:40:26,031:INFO:WorkQueueManagerCleaner:Start updating & cleaning... 529189-2012-02-02 13:40:26,539:INFO:WorkQueueManagerCleaner:Finished updating & cleaning. 529272-2012-02-02 13:43:04,949:INFO:WorkQueueReqMgrInterface:Contacting Request manager for more work 529367-2012-02-02 13:43:05,039:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/128.142.229.57:8687/2266295795689648715_PUT_request does not exist and the service at http://128.142.229.57:8687/reqmgr/reqMgr/request is unavailable - it returned 403 because Forbidden }}}

for completeness :

{{{ config.section_('DBSInterface') config.DBSInterface.DBSBlockMaxSize = 5000000000000 config.DBSInterface.doGlobalMigration = False config.DBSInterface.globalDBSUrl = 'https://cmsdbsprod.cern.ch:8443/cms_dbs_prod_global_writer/servlet/DBSServlet' config.DBSInterface.DBSUrl = 'https://cmsdbsprod.cern.ch:8443/cms_dbs_prod_global_writer/servlet/DBSServlet' config.DBSInterface.DBSBlockMaxTime = 86400 config.DBSInterface.DBSBlockMaxFiles = 500 config.DBSInterface.DBSVersion = 'DBS_2_0_8' config.DBSInterface.globalDBSVersion = 'DBS_2_0_8' config.DBSInterface.MaxFilesToCommit = 200

}}}

This smells like when cert/key were not correct, will test that too.

Request timestamp is 13:30, Diego sent the mail saying all it was fine at 12:30

stuartw commented 12 years ago

swakef: Replying to [comment:3 samir]:

Only concerning messages I got were :

{{{ est does not exist and the service at http://128.142.229.57:8687/reqmgr/reqMgr/request is unavailable - it returned 403 because Forbidden

This smells like when cert/key were not correct, will test that too.

Something security related definitely isn't happy.

samircury commented 12 years ago

samir: I grepped the reqmgr logs, and I see mostly 200's:

{{{ [vocms13] /data/cmsprod/wmagent/current > grep cmsdataops_YATest1_120202_153027_1576 install/reqmgr/reqmgr/reqmgr.log 02Feb/2012:15:30:27] vocms13.cern.ch 137.138.53.201 "GET /reqmgr/view/details/cmsdataops_YATest1_120202_153027_1576 HTTP/1.1" 200 [data: - in 1061 out 43709 us ] [auth: - "" "" ] [ref: "http://vocms13.cern.ch:8687/reqmgr/create/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 Red Hat/3.6.24-3.el6_1 Firefox/3.6.24" ] 02Feb/2012:15:30:28] vocms13.cern.ch 137.138.53.201 "POST /reqmgr/view/splitting/cmsdataops_YATest1_120202_153027_1576 HTTP/1.1" 200 [data: - in 2203 out 29330 us ] [auth: - "" "" ] [ref: "http://vocms13.cern.ch:8687/reqmgr/view/details/cmsdataops_YATest1_120202_153027_1576" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 Red Hat/3.6.24-3.el6_1 Firefox/3.6.24" ] 02Feb/2012:15:30:34] vocms13.cern.ch 137.138.53.201 "POST /reqmgr/view/handleSplittingPage HTTP/1.1" 200 [data: - in 154 out 160181 us ] [auth: - "" "" ] [ref: "http://vocms13.cern.ch:8687/reqmgr/view/splitting/cmsdataops_YATest1_120202_153027_1576" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 Red Hat/3.6.24-3.el6_1 Firefox/3.6.24" ] 02Feb/2012:15:30:39] vocms13.cern.ch 137.138.53.201 "POST /reqmgr/view/doAdmin HTTP/1.1" 200 [data: - in 143 out 15715 us ] [auth: - "" "" ] [ref: "http://vocms13.cern.ch:8687/reqmgr/view/details/cmsdataops_YATest1_120202_153027_1576" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 Red Hat/3.6.24-3.el6_1 Firefox/3.6.24" ] request sanitised input_data: {'status': 'failed', 'requestName': 'cmsdataops_YATest1_120202_153027_1576'} message sanitised input_data: {'request': 'cmsdataops_YATest1_120202_153027_1576'} 02Feb/2012:15:33:09] vocms13.cern.ch 128.142.229.57 "PUT /reqmgr/reqMgr/message/cmsdataops_YATest1_120202_153027_1576 HTTP/1.1" 200 [data: - in 4 out 4034 us ] [auth: - "" "" ] [ref: "" "WMCore.Services.Requests/v001" ] request sanitised input_data: {'status': 'failed', 'requestName': 'cmsdataops_YATest1_120202_153027_1576'} 02Feb/2012:16:05:35] vocms13.cern.ch 137.138.53.201 "GET /reqmgr/view/details/cmsdataops_YATest1_120202_153027_1576 HTTP/1.1" 200 [data: - in 1156 out 82534 us ] [auth: - "" "" ] [ref: "http://vocms13.cern.ch:8687/reqmgr/GlobalMonitor/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 Red Hat/3.6.24-3.el6_1 Firefox/3.6.24" ]

}}}

samircury commented 12 years ago

samir: {{{

2012-02-03 17:42:43,324:ERROR:WorkQueueReqMgrInterface:Error updating ReqMgr about request "cmsdataops_StressTestT0_5_120131_114038_3494": 2012-02-03 17:42:43,349:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/128.142.229.57:8687/-4490398838844699079_PUT_request does not exist and the service at http://128.142.229.57:8687/reqmgr/reqMgr/request is unavailable - it returned 403 because Forbidden }}}

The files are there, but empty, if I :

{{{

vocms13] /data/cmsprod/wmagent/current > curl -X GET http://128.142.229.57:8687/reqmgr/reqMgr/request[{'percent_success': 0, 'Group': 'cmsdataops', 'Requestor': 'cmsdataops', 'RequestType': 'ReReco', 'InputDatasetTypes': {'/L1JetHPF/Run2011B-v1/RAW': 'source'}, 'ReqMgrRequestID': 1L, 'OutputDatasets': [], 'SoftwareVersions': ['CMSSW_4_4_2_p10JEmalloc'], 'RequestSizeEvents': 0L, 'AcquisitionEra': None, 'RequestSizeFiles': 0L, 'RequestPriority': 2L, 'RequestName': 'cmsdataops_StressTestT0_120130_112138_8153', 'InputDatasets': ['/L1JetHPF/Run2011B-v1/RAW'], 'percent_complete': 0, 'RequestWorkflow'

MORE... }}}

samircury commented 12 years ago

samir: It seems the error is referring to this piece of code

{{{ def report(self, queue): """Report queue status to ReqMgr.""" new_state = {}

    elements = queue.statusInbox(dictKey = "RequestName")
    if not elements:
        return new_state

    for ele in elements:
        ele = elements[ele]0 # 1 element tuple
        # check if element has progressed since last report
        if getattr(ele, 'updatetime', getattr(ele, 'timestamp', 0)) <= self.previous_state.get(ele['RequestName'], 0):
            # no updates for this element
            new_state[ele['RequestName']] = self.previous_state.get(ele['RequestName'], 0)
            continue

        try:
            self.reportElement(ele)
        except Exception, ex:
            msg = 'Error updating ReqMgr about request "%s": %s'
            self.logger.error(msg % (ele['RequestName'], str(ex)))
        else:
            new_state[ele['RequestName']] = ele.updatetime
    self.previous_state = new_state
    return elements

}}}

Which fails like the previous post, so I won't repeat.

What is new is that I went to ReqMgr logs to search about those requests and why we're getting 403, which looks so misleading, because it seems it's not security issue, but a logic issue, that the agent is trying to change state from failed to running. After itself failed the request for not finding data. Actually I don't know if it blindly changes the request status without checking the previous one.

{{{ 48299056-DEBUG:cherrypy.error:request raw data: {'args': (), 'kwargs': {}} 48299122-DEBUG:cherrypy.error:request sanitised input_data: {'status': 'running', 'requestName': 'cmsdataops_StressTestT0SingleBLockJEMA_120201_172244_5463'} 48299271-DEBUG:cherrypy.error:call to PUT with args: ['request'] kwargs: {'status': 'running', 'requestName': 'cmsdataops_StressTestT0SingleBLockJEMA_120201_172244_5463'} resulted in Failed to change status: Cannot change status from failed to running. Allowed values are ['failed', 'testing-approved', 'assigned'] 48299578:INFO:cherrypy.access:08Feb/2012:10:38:02] vocms13.cern.ch 128.142.229.57 "PUT /reqmgr/reqMgr/request HTTP/1.1" 403 [data: - in 186 out 13466 us ] [auth: - "" "" ] [ref: "" "WMCore.Services.Requests/v001" ] 48299786-WARNING:cherrypy.access:NullAuth called for: 48299831-WARNING:cherrypy.access: role(s): [] 48299869- group(s): [] 48299884- site(s): [] 48299897-DEBUG:cherrypy.error:agentmonitor raw data: {'args': (), 'kwargs': {}}

}}}

stuartw commented 12 years ago

swakef: If the requestis failed why is reqmgr presenting it to workqueue when it asks for new work? Also, why do you think its trying to mark it as running, from the log its trying to update the state to failed.

samircury commented 12 years ago

samir: Ok, what I just posted is "unrelated", but also a problem, it seems, not deadly urgent, because if requests follow the natural path it wouldn't happen.

So, I found out that besides the misleading error, the problem was that the sitedb API needs :

{{{ export X509_USER_CERT=/data/cmsprod/X509/tier0certv13.pem export X509_USER_KEY=/data/cmsprod/X509/tier0keyv13.pem }}}

Before the start-agent in order to authenticate. I propose we put this as a mandatory condition so the manage script would refuse to start if this is not set. Another option is to crash the component on purpose or put a warning in the logs so we know what's the potential pitfall. Preslav told me that the Condor(?) and Glite plugins need those guys set, so I guess it could be a general prerequisite.

Or to make it simple, as I see this way is done, (no couch certs, we use X509 certs (envvars)), we could set those vars as the couch cert and key, that would also work.

What is essential is to make sure no one will be able to start everything, without the most basic requisites, that would make the system to not work properly and still not tell you exactly why.