Closed ticoann closed 10 years ago
@ticoann does the following look reasonable?
https://github.com/alexanderrichards/WMCore/compare/timestamp_sorting
Cheers Alex
@alexanderrichards, I think that is right place but I am not sure whether it works. Are you able to set up the test? I think it is one of those problem, that test might be much harder than fixing the code. Test would be submitting the same workflow (meaning with same spec) and see whether it gets in turn. I am not sure which workflow setting will get the quick result. I can ask around.
I am again having difficulty following the instructions on:
I keep getting into problems like the following again:
Warning: Using storage engine MyISAM for table '...'
I had this issue before and couldn't fix it but I have followed your instructions on the above url.
Cheers Alex
@alexanderrichards, as we talked before I wasn't able to reproduce the problem. Could you give me access to vm you are installing I can try to deploy. Anyway, I can try to test this patch. but you should solve this problem
yes sure, do I need your ssh key to give you access?
Annoyingly things are getting worse! Now when I wipe everything clean and start again I get problems at this stage.
./Deploy -r comp=comp -s sw -A slc5_amd64_gcc461 -t v0.9.82 $DEPLOY_DIR wmagent@0.9.82
I see the following error:
INFO: 20140131121719: starting deployment of: wmagent@0.9.82 INFO: deploying wmagent - variant: default, version: 0.9.82 INFO: bootstrapping comp software area in /data/srv/wmagent/v0.9.82/sw ERROR: bootstrap failed INFO: installation log can be found in /data/srv/wmagent/.deploy/20140131-121719-15044-sw.log ERROR: installation failed with exit code 1
Looking at the log sheds no light on the situation:
++ mkdir -p /data/srv/wmagent/v0.9.82/sw ++ cd /data/srv/wmagent/v0.9.82/sw ++ curl -sO http://cmsrep.cern.ch/cmssw/comp/bootstrap.sh ++ sh -x ./bootstrap.sh -architecture slc5_amd64_gcc461 -path /data/srv/wmagent/v0.9.82/sw -repository comp setup ++ '[' 1 = 0 ']' ++ note 'ERROR: bootstrap failed' ++ echo 'ERROR: bootstrap failed' ERROR: bootstrap failed
except to point out that the bootstrap failed
Running the bootstrap command again by hand:
sh -x ./bootstrap.sh -architecture slc5_amd64_gcc461 -path /data/srv/wmagent/v0.9.82/sw -repository comp setup
gives firstly the warning:
Warning, /data/srv/wmagent/v0.9.82/sw already set up. Do you want to reconfigure it? [ y / N ]
Which seems reasonable since I'm running it manually not after I already tried before, so I say yes here and then I get:
- sed -e 's|[$]RPM_INSTALL_PREFIX|/data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/inst|g'
- sh -ex /tmp/arichard/tmp16704/scriptlets/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm.pre.sh ++ id -u
- '[' X22855 = X0 ']'
- perl ./myrpm2cpio /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm
- cpio -id can't pipe to gzip cpio: premature end of archive
- cleanup_and_exit 1 'Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm'
- exitcode=1
- exitmessage='Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm'
- '[' 'XUnable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm' = X ']'
- echo
- echo Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm
- Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm
I have no idea what the problem is here with the rpm and I cant test it either as the last thing the bootstrap does before exit 1 is
- rm -rf /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP
hence rpm is gone now anyway!
hmmm seems like it might have been a server issue as when I retry the same bootstrap it now completes ok and I didn't change anything.
Cheers Alex
ok so not quite there. I tried again and again I got the unpacking error. I ran the bootstrap three times in sucession and on the third time it just worked. This I guess is some server error. However things aren't quite right still as:
ls $manage
gives:
ls: /data/srv/wmagent/current/config/wmagent/manage: No such file or directory
Hi Alex, I tried your script which sent me in separate email in the new vm. and I didn't have any problem installing it. I am not sure what went wrong when you ran it. Instead of trying to debugging the problem, could you try again in fresh new vm?
Yes indeed this works if I start with a clean VM. When I try to inject the request though I see the following:
INFO:root:Injecting a request for arguments (REST API): {u'TotalTime': 14400, u'PrepID': u'MCTEST-GEN-0001', u'GlobalTag': u'START311_V2::All', u'Campaign': u'Test_alex', u'RequestPriority': 1000, u'Group': u'DATAOPS', u'Memory': 2000, u'TimePerEvent': 40, u'FilterEfficiency': 0.0361, u'SplittingAlgo': u'EventBased', u'RequestType': u'MonteCarlo', u'ScramArch': u'slc5_amd64_gcc434', u'SizePerEvent': 512, u'ConfigCacheID': u'4029c9cd130f25d65bdced2311536c52', u'ConfigCacheUrl': u'http://137.138.229.40:5984/couchdb', u'RequestString': u'test_MC_Files', u'PrimaryDataset': u'BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen', u'EventsPerJob': 1400, u'RequestNumEvents': 2000, u'CMSSWVersion': u'CMSSW_4_1_8'} ... INFO:root:Request: PUT /reqmgr/reqMgr/request ... ERROR:root:Error occurred, exit. {"exception": 400, "message": "Create request failed, ConfigCacheException", "type": "HTTPError"}
@alexanderrichards, Hi Alex I think this error is due to that config cache doesn't exit in the url specified in the spec. Could you use the testbed urls? You can send me your json file for request and I can take a look as well.
I have attached the request json. In it I changed the config cache to point at the local couchdb instance. Is this the wrong thing to do?
{ "createRequest": {
"CMSSWVersion": "CMSSW_4_1_8", "GlobalTag": "START311_V2::All", "Campaign": "Test_alex", "RequestString": "alex_test", "RequestPriority": 1000, "FilterEfficiency": 0.0361, "ScramArch": "slc5_amd64_gcc434", "RequestType": "MonteCarlo", "RequestNumEvents": 2000, "ConfigCacheID": "4029c9cd130f25d65bdced2311536c52", "ConfigCacheUrl": "http://137.138.229.40:5984/couchdb", "PrimaryDataset": "BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen", "PrepID": "MCTEST-GEN-0001", "Group": "DATAOPS", "TotalTime": 14400, "TimePerEvent": 40, "Memory": 2000, "SizePerEvent": 512, "SplittingAlgo" : "EventBased", "EventsPerJob" : 1400 }, "changeSplitting": { "Production" : { "SplittingAlgo" : "EventBased", "events_per_job" : 1400, "include_parents" : "False" } }, "assignRequest": { "SiteWhitelist": ["T1_US_FNAL"], "SiteBlacklist": [], "MergedLFNBase": "/store/backfill/1", "UnmergedLFNBase": "/store/unmerged", "MinMergeSize": 2147483648, "MaxMergeSize": 4294967296, "MaxMergeEvents": 50000, "AcquisitionEra": "AcquisitionEra-test", "ProcessingVersion": 1, "ProcessingString" : "TestAlex", "maxRSS": 4294967296, "maxVSize": 4294967296, "SoftTimeout": 129600, "GracePeriod": 300, "dashboard": "mc", "Team": "cmsdataops", "CustodialSites": [], "NonCustodialSites": [], "AutoApproveSubscriptionSites": [], "SubscriptionPriority": "Low", "CustodialSubType" : "Move", "BlockCloseMaxWaitTime" : 66400, "BlockCloseMaxFiles" : 500, "BlockCloseMaxEvents" : 25000000, "BlockCloseMaxSize" : 5000000000000 } }
@alexanderrichards, it is not wrong thing to do. Just you need another step to generate the config cache which is rather complicated things to do. You can use original config cache url from the example "https://cmsweb-testbed.cern.ch/couchdb". Which gets the config cache from test bed. and original id should match
I changed the config cache url to cmsweb-testbed as you suggest but I still get the error:
{"exception": 400, "message": "Create request failed, ConfigCacheException", "type": "HTTPError"}
What do I need to change the id to?
@alexanderrichards, it seems that id is exist in cmsweb-testbed https://cmsweb-testbed.cern.ch/couchdb/_utils/document.html?reqmgr_config_cache/4029c9cd130f25d65bdced2311536c52 not sure why you still get the error. Could you check reqmgr log? under /data/srv/wmagent/current/install/reqmgr/reqmgr/reqmgr.log
seems to be an authentication issue maybe? what do you make of the following log entry?
Creating a request for: '{'PrepID': 'MCTEST-GEN-0001', 'Requestor': 'fbloggs', 'ScramArch': 'slc5_amd64_gcc434', 'SizePerEvent': 512, 'ConfigCacheID': '4029c9cd130f25d65bdced2311536c52', 'Memory': 2000, 'Group': 'DATAOPS', 'RequestType': 'MonteCarlo', 'TimePerEvent': 40, 'PrimaryDataset': 'BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen', 'CouchDBName': 'wmagent_configcache', 'CMSSWVersion': 'CMSSW_4_1_8', 'RequestPriority': 1000, 'SplittingAlgo': 'EventBased', 'RequestString' : 'test_MC_Files', 'CouchURL': 'http://alexwmagent3:5984', 'TotalTime': 14400, 'CouchWorkloadDBName': 'reqmgrdb', 'Campaign': 'Test_alex', 'GlobalTag': 'START311_V2::All', 'FilterEfficiency': 0.0361, 'ConfigCacheUrl': 'https://cmsweb-tes tbed.cern.ch/couchdb', 'EventsPerJob': 1400, 'RequestNumEvents': 2000}' workloadDB: 'reqmgrdb' wmstatUrl: 'http://alexwmagent3:5984/wmstats' ... makeRequest(): reqInputArgs: '{'PrepID': 'MCTEST-GEN-0001', 'Requestor': 'fbloggs', 'ScramArch': 'slc5_amd64_gcc434', 'SizePerEvent': 512, 'ConfigCacheID': '4029c9cd130f25d65bdced2311536c52', 'Memory': 2000, 'Group': 'DATAOPS', 'RequestT ype': 'MonteCarlo', 'TimePerEvent': 40, 'PrimaryDataset': 'BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen', 'CouchDBName': 'wmagent_configcache', 'CMSSWVersion': 'CMSSW_4_1_8', 'RequestPriority': 1000, 'SplittingAlgo': 'EventBased', 'RequestSt ring': 'test_MC_Files', 'CouchURL': 'http://alexwmagent3:5984', 'TotalTime': 14400, 'CouchWorkloadDBName': 'reqmgrdb', 'Campaign': 'Test_alex', 'GlobalTag': 'START311_V2::All', 'FilterEfficiency': 0.0361, 'ConfigCacheUrl': 'https://cmswe b-testbed.cern.ch/couchdb', 'EventsPerJob': 1400, 'RequestNumEvents': 2000}' Error connecting to couch: CouchUnauthorisedError - reason: Unauthorized, data: {} result: None Traceback (most recent call last): File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Cache/WMConfigCache.py", line 56, in init self.createDatabase() File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Cache/WMConfigCache.py", line 89, in createDatabase database = self.couchdb.createDatabase(self.dbname) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Database/CMSCouch.py", line 817, in createDatabase self.put("/%s" % urllib.quote_plus(dbname)) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Services/Requests.py", line 125, in put encode, decode, contentType) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Database/CMSCouch.py", line 114, in makeRequest getattr(e, "reason", None), data) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Database/CMSCouch.py", line 127, in checkForCouchError raise CouchUnauthorisedError(reason, data, result) CouchUnauthorisedError: CouchUnauthorisedError - reason: Unauthorized, data: {} result: None
yes, It seems authentication problem. could you check whether you can access with the same cert you are using for agent? https://cmsweb-testbed.cern.ch/couchdb/ One thing is key you are using to access shouldn't be encrypted.
no seems I cannot access that web page using the cert. What is the command to create a key that is unencrypted?
openssl pkcs12 -in pkcs12.pfx -nocerts -nodes -out my.key
I still have no luck runing tests
It seems that timestamp is not part of workqueue element. https://github.com/dmwm/WMCore/pull/4977/files#diff-1ab10a7628256d3b161f4e66218ce294R379 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataStructs/WorkQueueElement.py#L13 Need to get the timestamp values some where else or put the timestamp in workqueue element
ah ok, timestamp is a part of CouchWorkQuque Element though: https://github.com/alexanderrichards/WMCore/blob/01d1b4fe3ac92cb0bf403e03ee0b9ab9f206ce98/src/python/WMCore/WorkQueue/DataStructs/CouchWorkQueueElement.py
Also looking at it again it looks like it should read more like
# sort elements to get them in timestamp order
elements = sorted(elements, key=lambda element: element.timestamp)
since CouchWorkQueue has the timestamp but WorkQueue does not, is it ok to assume that we will have just CouchWorkQueue objects or should I explicitly check, Also then in the case of non-CouchWorkQueue objects, where do we get this timestamp from?
Cheers Alex
@alexanderrichards, Actually, I was able to fix that. #5015. I had to merge it since we need new tag today. Thanks for looking into it.
@ticoann ok thanks
@alexanderrichards, actually could you review the code? We have still some hours to before making the tag. I did minimal tested in vm. But it would be nice if you can look at it as well.
yup I'll have a look now.
I've commented on the merge diff, see: https://github.com/dmwm/WMCore/pull/5015/files
As I have a question
Thanks for the clarification of the problem. I think it looks good then.
Cheers Alex
First modify how work queue disperses chunks of work so that it’s working towards completing workflows
When local workqueue pull the work add the time factor as well as priority. If two workflow has the same time stamps need to find the way to get the work from one workflow first rather than randomly getting the work from different workflows.