dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

WMAgent - Not finding block location in vocms13 #3180

Closed samircury closed 12 years ago

samircury commented 12 years ago

Anything I try to inject after the patch that fixes the splitting, I get this now :

{{{ Invalid WMSpec: 'cmsdataops_TestNewRecoSplitting2_120129_115318_9955': Input data has no locations "/L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace" }}}

DAS says :

https://cmsweb.cern.ch/das/request?view=list&limit=10&instance=cms_dbs_prod_global&input=block%3D%2FL1JetHPF%2FRun2011B-v1%2FRAW%239e3193ba-ff28-11e0-8e87-003048caaace

That the data is on T0_CH_CERN_MSS.

I did 2 requests if you grep for "TestNewRecoSplitting", one is lumi-based default splitting (2) and the first was event based (3000) splitting.

Same error on both. Posting here so devs can comment.

samircury commented 12 years ago

samir: Ah, should have posted this too, install is in :

[vocms13] /data/cmsprod/wmagent/current >

(this will change in the oracle WMA)

and ReqMgr is :

http://vocms13.cern.ch:8687/reqmgr

stuartw commented 12 years ago

swakef: What patch?

I cant login to that machine (is it in the production e-group?) and the reqmgr web pages appears to be broken, so i cant find the summary page for that request. Can you point me to the reqmgr info for that request so i can have a look at the spec.

samircury commented 12 years ago

samir: I'm sorry, we thought that only Steve would look into this in the afternoon, and we had to go on with some stress tests, which wouldn't be possible with this bug, so I ran a "clean-all" to remove the patch so we could go on with the tests.

Maybe is better if we follow up with Steve so he can redeploy the patch and we reproduce the problem.

Thanks anyway.

The patch is the one that fixes the ReReco splitting which was not working properly. I never saw the patch or revision number but I know it was applied.

samircury commented 12 years ago

samir: btw, seems that the clean-all didn't remove the bug. I'm still getting the same problem with a request I just injected, probably you can't login to the machine, but if the requestMgr helps, it's running. I will ask the CRC who can give you login privileges

samircury commented 12 years ago

samir: Found this in the WorkQueueManager logs :

{{{

2012-01-30 11:24:17,588:INFO:WorkQueueReqMgrInterface:Contacting Request manager for more work 2012-01-30 11:24:17,629:INFO:WorkQueueReqMgrInterface:Processing request cmsdataops_StressTestT0_120130_112138_8153 at http://vocms13.cern.ch:5984/reqmgrdb/cmsdataops_StressTestT0_120130_112138_8153/spec 2012-01-30 11:24:17,629:INFO:WorkQueue:queueWork() begin queueing "http://vocms13.cern.ch:5984/reqmgrdb/cmsdataops_StressTestT0_120130_112138_8153/spec" 2012-01-30 11:24:17,919:INFO:WorkQueue:Splitting cmsdataops_StressTestT0_120130_112138_8153 with policy Block params = {'ResubmitBlock': {'args': {}, 'name': 'ResubmitBlock'}, 'MonteCarlo': {'args': {}, 'name': 'MonteCarlo'}, 'Dataset': {'args': {}, 'name': 'Dataset'}, 'Block': {'args': {'SliceSize': 3000, 'policyName': 'Block', 'SliceType': 'NumberOfEvents'}, 'name': 'Block'}, 'DatasetBlock': {'args': {}, 'name': 'Dataset'}} 2012-01-30 11:24:18,946:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/6093079634102884686_GET_seToCMSName_srm-cms.gridpp.rl.ac.uk.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed 2012-01-30 11:24:18,996:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/2814624582737769406_GET_seToCMSName_cmssrm.hep.wisc.edu.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed 2012-01-30 11:24:19,046:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/2258780201460884133_GET_seToCMSName_srm-cms.cern.ch.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed 2012-01-30 11:24:19,096:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/1182497830268614191_GET_seToCMSName_srm-eoscms.cern.ch.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed 2012-01-30 11:24:19,146:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/6332884222569658975_GET_seToCMSName_dcache-se-cms.desy.de.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed 2012-01-30 11:24:19,196:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/3638935438293509632_GET_seToCMSName_cmssrm.fnal.gov.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed 2012-01-30 11:24:19,839:INFO:WorkQueue:Failing workflow "cmsdataops_StressTestT0_120130_112138_8153": Invalid WMSpec: 'cmsdataops_StressTestT0_120130_112138_8153': Input data has no locations "/L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace" 2012-01-30 11:24:19,861:INFO:WorkQueueReqMgrInterface:Permanent failure processing request "cmsdataops_StressTestT0_120130_112138_8153": Invalid WMSpec: 'cmsdataops_StressTestT0_120130_112138_8153': Input data has no locations "/L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace" 2012-01-30 11:24:19,861:INFO:WorkQueueReqMgrInterface:Marking request cmsdataops_StressTestT0_120130_112138_8153 as failed in ReqMgr 2012-01-30 11:24:20,089:INFO:WorkQueue:Deleting request "cmsdataops_StressTestT0_120130_112138_8153" as it is Failed }}}

Could sound like wrong CMS_NAME. Will try to change them in the DB.

stuartw commented 12 years ago

swakef: Replying to [comment:5 samir]:

2012-01-30 11:24:18,946:WARNING:Service:The cachefile /data/cmsprod/wmagent/0.8.23/install/workqueue/WorkQueueManager/.wmcore_cache/.wmcore_cache_5410/requests/cmsweb.cern.ch/6093079634102884686_GET_seToCMSName_srm-cms.gridpp.rl.ac.uk.json does not exist and the service at https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName raised a BadStatusLine('',) when accessed

ok, so that the problem. The call to resolve se's to sites is failing and thus the workqueue doesn't see any sites hosting the data.

Changing the name shouldn't make a difference that would result in a different error. Can you check the !WorkQueueManager has a valid proxy/cert and try a manual curl and see what you get: {{{ curl -v -k --cert $X509_USER_PROXY --key $X509_USER_PROXY 'https://cmsweb.cern.ch/sitedb/json/index/SEtoCMSName?name=cmssrm.hep.wisc.edu' }}}

samircury commented 12 years ago

samir: Talking to Stuart in the chat we found the reason, my bad, in secrets file there was a cert as a key, I updated the X509* variables and restarted, that did.

{{{

2012-01-30 14:38:58,942:INFO:WorkQueueReqMgrInterface:Contacting Request manager for more work 2012-01-30 14:38:58,971:INFO:WorkQueueReqMgrInterface:Processing request cmsdataops_StressTestT0_4_120130_142844_1970 at http://vocms13.cern.ch:5984/reqmgrdb/cmsdataops_StressTestT0_4_120130_142844_1970/spec 2012-01-30 14:38:58,971:INFO:WorkQueue:queueWork() begin queueing "http://vocms13.cern.ch:5984/reqmgrdb/cmsdataops_StressTestT0_4_120130_142844_1970/spec" 2012-01-30 14:38:59,263:INFO:WorkQueue:Splitting cmsdataops_StressTestT0_4_120130_142844_1970 with policy Block params = {'ResubmitBlock': {'args': {}, 'name': 'ResubmitBlock'}, 'MonteCarlo': {'args': {}, 'name': 'MonteCarlo'}, 'Dataset': {'args': {}, 'name': 'Dataset'}, 'Block': {'args': {'SliceSize': 3000, 'policyName': 'Block', 'SliceType': 'NumberOfEvents'}, 'name': 'Block'}, 'DatasetBlock': {'args': {}, 'name': 'Dataset'}} 2012-01-30 14:39:00,831:INFO:WorkQueue:Queuing element for cmsdataops_StressTestT0_4_120130_142844_1970:DataProcessing with 277 job(s) split with Block on /L1JetHPF/Run2011B-v1/RAW#9e3193ba-ff28-11e0-8e87-003048caaace 2012-01-30 14:39:01,044:INFO:WorkQueue:Split work for request(s): "cmsdataops_StressTestT0_4_120130_142844_1970" 2012-01-30 14:39:01,075:INFO:WorkQueueReqMgrInterface:1 units(s) queued for "cmsdataops_StressTestT0_4_120130_142844_1970" 2012-01-30 14:39:01,076:INFO:WorkQueueReqMgrInterface:1 element(s) obtained from RequestManager

}}}

You can close the ticket.