dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Retrive work in by workflow when the workflows are in the same priority #4911

Closed ticoann closed 10 years ago

ticoann commented 10 years ago

First modify how work queue disperses chunks of work so that it’s working towards completing workflows

When local workqueue pull the work add the time factor as well as priority. If two workflow has the same time stamps need to find the way to get the work from one workflow first rather than randomly getting the work from different workflows.

alexanderrichards commented 10 years ago

@ticoann does the following look reasonable?

https://github.com/alexanderrichards/WMCore/compare/timestamp_sorting

Cheers Alex

ticoann commented 10 years ago

@alexanderrichards, I think that is right place but I am not sure whether it works. Are you able to set up the test? I think it is one of those problem, that test might be much harder than fixing the code. Test would be submitting the same workflow (meaning with same spec) and see whether it gets in turn. I am not sure which workflow setting will get the quick result. I can ask around.

alexanderrichards commented 10 years ago

I am again having difficulty following the instructions on:

https://github.com/dmwm/WMCore/wiki/All-in-one-test

I keep getting into problems like the following again:

Warning: Using storage engine MyISAM for table '...'

I had this issue before and couldn't fix it but I have followed your instructions on the above url.

Cheers Alex

ticoann commented 10 years ago

@alexanderrichards, as we talked before I wasn't able to reproduce the problem. Could you give me access to vm you are installing I can try to deploy. Anyway, I can try to test this patch. but you should solve this problem

alexanderrichards commented 10 years ago

yes sure, do I need your ssh key to give you access?

alexanderrichards commented 10 years ago

Annoyingly things are getting worse! Now when I wipe everything clean and start again I get problems at this stage.

./Deploy -r comp=comp -s sw -A slc5_amd64_gcc461 -t v0.9.82 $DEPLOY_DIR wmagent@0.9.82

I see the following error:

INFO: 20140131121719: starting deployment of: wmagent@0.9.82 INFO: deploying wmagent - variant: default, version: 0.9.82 INFO: bootstrapping comp software area in /data/srv/wmagent/v0.9.82/sw ERROR: bootstrap failed INFO: installation log can be found in /data/srv/wmagent/.deploy/20140131-121719-15044-sw.log ERROR: installation failed with exit code 1

alexanderrichards commented 10 years ago

Looking at the log sheds no light on the situation:

++ mkdir -p /data/srv/wmagent/v0.9.82/sw ++ cd /data/srv/wmagent/v0.9.82/sw ++ curl -sO http://cmsrep.cern.ch/cmssw/comp/bootstrap.sh ++ sh -x ./bootstrap.sh -architecture slc5_amd64_gcc461 -path /data/srv/wmagent/v0.9.82/sw -repository comp setup ++ '[' 1 = 0 ']' ++ note 'ERROR: bootstrap failed' ++ echo 'ERROR: bootstrap failed' ERROR: bootstrap failed

except to point out that the bootstrap failed

alexanderrichards commented 10 years ago

Running the bootstrap command again by hand:

sh -x ./bootstrap.sh -architecture slc5_amd64_gcc461 -path /data/srv/wmagent/v0.9.82/sw -repository comp setup

gives firstly the warning:

Warning, /data/srv/wmagent/v0.9.82/sw already set up. Do you want to reconfigure it? [ y / N ]

Which seems reasonable since I'm running it manually not after I already tried before, so I say yes here and then I get:

  • sed -e 's|[$]RPM_INSTALL_PREFIX|/data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/inst|g'
  • sh -ex /tmp/arichard/tmp16704/scriptlets/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm.pre.sh ++ id -u
  • '[' X22855 = X0 ']'
  • perl ./myrpm2cpio /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm
  • cpio -id can't pipe to gzip cpio: premature end of archive
  • cleanup_and_exit 1 'Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm'
  • exitcode=1
  • exitmessage='Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm'
  • '[' 'XUnable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm' = X ']'
  • echo
  • echo Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm
  • Unable to unpack /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP/external+gcc+4.6.1-comp3-1-1.slc5_amd64_gcc461.rpm

I have no idea what the problem is here with the rpm and I cant test it either as the last thing the bootstrap does before exit 1 is

  • rm -rf /data/srv/wmagent/v0.9.82/sw/bootstraptmp/BOOTSTRAP

hence rpm is gone now anyway!

alexanderrichards commented 10 years ago

hmmm seems like it might have been a server issue as when I retry the same bootstrap it now completes ok and I didn't change anything.

Cheers Alex

alexanderrichards commented 10 years ago

ok so not quite there. I tried again and again I got the unpacking error. I ran the bootstrap three times in sucession and on the third time it just worked. This I guess is some server error. However things aren't quite right still as:

ls $manage

gives:

ls: /data/srv/wmagent/current/config/wmagent/manage: No such file or directory

ticoann commented 10 years ago

Hi Alex, I tried your script which sent me in separate email in the new vm. and I didn't have any problem installing it. I am not sure what went wrong when you ran it. Instead of trying to debugging the problem, could you try again in fresh new vm?

alexanderrichards commented 10 years ago

Yes indeed this works if I start with a clean VM. When I try to inject the request though I see the following:

INFO:root:Injecting a request for arguments (REST API): {u'TotalTime': 14400, u'PrepID': u'MCTEST-GEN-0001', u'GlobalTag': u'START311_V2::All', u'Campaign': u'Test_alex', u'RequestPriority': 1000, u'Group': u'DATAOPS', u'Memory': 2000, u'TimePerEvent': 40, u'FilterEfficiency': 0.0361, u'SplittingAlgo': u'EventBased', u'RequestType': u'MonteCarlo', u'ScramArch': u'slc5_amd64_gcc434', u'SizePerEvent': 512, u'ConfigCacheID': u'4029c9cd130f25d65bdced2311536c52', u'ConfigCacheUrl': u'http://137.138.229.40:5984/couchdb', u'RequestString': u'test_MC_Files', u'PrimaryDataset': u'BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen', u'EventsPerJob': 1400, u'RequestNumEvents': 2000, u'CMSSWVersion': u'CMSSW_4_1_8'} ... INFO:root:Request: PUT /reqmgr/reqMgr/request ... ERROR:root:Error occurred, exit. {"exception": 400, "message": "Create request failed, ConfigCacheException", "type": "HTTPError"}

ticoann commented 10 years ago

@alexanderrichards, Hi Alex I think this error is due to that config cache doesn't exit in the url specified in the spec. Could you use the testbed urls? You can send me your json file for request and I can take a look as well.

alexanderrichards commented 10 years ago

I have attached the request json. In it I changed the config cache to point at the local couchdb instance. Is this the wrong thing to do?

{ "createRequest": {
"CMSSWVersion": "CMSSW_4_1_8", "GlobalTag": "START311_V2::All", "Campaign": "Test_alex", "RequestString": "alex_test", "RequestPriority": 1000, "FilterEfficiency": 0.0361, "ScramArch": "slc5_amd64_gcc434", "RequestType": "MonteCarlo", "RequestNumEvents": 2000, "ConfigCacheID": "4029c9cd130f25d65bdced2311536c52", "ConfigCacheUrl": "http://137.138.229.40:5984/couchdb", "PrimaryDataset": "BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen", "PrepID": "MCTEST-GEN-0001", "Group": "DATAOPS", "TotalTime": 14400, "TimePerEvent": 40, "Memory": 2000, "SizePerEvent": 512, "SplittingAlgo" : "EventBased", "EventsPerJob" : 1400 }, "changeSplitting": { "Production" : { "SplittingAlgo" : "EventBased", "events_per_job" : 1400, "include_parents" : "False" } }, "assignRequest": { "SiteWhitelist": ["T1_US_FNAL"], "SiteBlacklist": [], "MergedLFNBase": "/store/backfill/1", "UnmergedLFNBase": "/store/unmerged", "MinMergeSize": 2147483648, "MaxMergeSize": 4294967296, "MaxMergeEvents": 50000, "AcquisitionEra": "AcquisitionEra-test", "ProcessingVersion": 1, "ProcessingString" : "TestAlex", "maxRSS": 4294967296, "maxVSize": 4294967296, "SoftTimeout": 129600, "GracePeriod": 300, "dashboard": "mc", "Team": "cmsdataops", "CustodialSites": [], "NonCustodialSites": [], "AutoApproveSubscriptionSites": [], "SubscriptionPriority": "Low", "CustodialSubType" : "Move", "BlockCloseMaxWaitTime" : 66400, "BlockCloseMaxFiles" : 500, "BlockCloseMaxEvents" : 25000000, "BlockCloseMaxSize" : 5000000000000 } }

ticoann commented 10 years ago

@alexanderrichards, it is not wrong thing to do. Just you need another step to generate the config cache which is rather complicated things to do. You can use original config cache url from the example "https://cmsweb-testbed.cern.ch/couchdb". Which gets the config cache from test bed. and original id should match

alexanderrichards commented 10 years ago

I changed the config cache url to cmsweb-testbed as you suggest but I still get the error:

{"exception": 400, "message": "Create request failed, ConfigCacheException", "type": "HTTPError"}

What do I need to change the id to?

ticoann commented 10 years ago

@alexanderrichards, it seems that id is exist in cmsweb-testbed https://cmsweb-testbed.cern.ch/couchdb/_utils/document.html?reqmgr_config_cache/4029c9cd130f25d65bdced2311536c52 not sure why you still get the error. Could you check reqmgr log? under /data/srv/wmagent/current/install/reqmgr/reqmgr/reqmgr.log

alexanderrichards commented 10 years ago

seems to be an authentication issue maybe? what do you make of the following log entry?

Creating a request for: '{'PrepID': 'MCTEST-GEN-0001', 'Requestor': 'fbloggs', 'ScramArch': 'slc5_amd64_gcc434', 'SizePerEvent': 512, 'ConfigCacheID': '4029c9cd130f25d65bdced2311536c52', 'Memory': 2000, 'Group': 'DATAOPS', 'RequestType': 'MonteCarlo', 'TimePerEvent': 40, 'PrimaryDataset': 'BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen', 'CouchDBName': 'wmagent_configcache', 'CMSSWVersion': 'CMSSW_4_1_8', 'RequestPriority': 1000, 'SplittingAlgo': 'EventBased', 'RequestString' : 'test_MC_Files', 'CouchURL': 'http://alexwmagent3:5984', 'TotalTime': 14400, 'CouchWorkloadDBName': 'reqmgrdb', 'Campaign': 'Test_alex', 'GlobalTag': 'START311_V2::All', 'FilterEfficiency': 0.0361, 'ConfigCacheUrl': 'https://cmsweb-tes tbed.cern.ch/couchdb', 'EventsPerJob': 1400, 'RequestNumEvents': 2000}' workloadDB: 'reqmgrdb' wmstatUrl: 'http://alexwmagent3:5984/wmstats' ... makeRequest(): reqInputArgs: '{'PrepID': 'MCTEST-GEN-0001', 'Requestor': 'fbloggs', 'ScramArch': 'slc5_amd64_gcc434', 'SizePerEvent': 512, 'ConfigCacheID': '4029c9cd130f25d65bdced2311536c52', 'Memory': 2000, 'Group': 'DATAOPS', 'RequestT ype': 'MonteCarlo', 'TimePerEvent': 40, 'PrimaryDataset': 'BdToMuMu_2MuPtFilter_7TeV-pythia6-evtgen', 'CouchDBName': 'wmagent_configcache', 'CMSSWVersion': 'CMSSW_4_1_8', 'RequestPriority': 1000, 'SplittingAlgo': 'EventBased', 'RequestSt ring': 'test_MC_Files', 'CouchURL': 'http://alexwmagent3:5984', 'TotalTime': 14400, 'CouchWorkloadDBName': 'reqmgrdb', 'Campaign': 'Test_alex', 'GlobalTag': 'START311_V2::All', 'FilterEfficiency': 0.0361, 'ConfigCacheUrl': 'https://cmswe b-testbed.cern.ch/couchdb', 'EventsPerJob': 1400, 'RequestNumEvents': 2000}' Error connecting to couch: CouchUnauthorisedError - reason: Unauthorized, data: {} result: None Traceback (most recent call last): File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Cache/WMConfigCache.py", line 56, in init self.createDatabase() File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Cache/WMConfigCache.py", line 89, in createDatabase database = self.couchdb.createDatabase(self.dbname) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Database/CMSCouch.py", line 817, in createDatabase self.put("/%s" % urllib.quote_plus(dbname)) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Services/Requests.py", line 125, in put encode, decode, contentType) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Database/CMSCouch.py", line 114, in makeRequest getattr(e, "reason", None), data) File "/data/srv/wmagent/v0.9.82/sw/slc5_amd64_gcc461/cms/wmagent/0.9.82/lib/python2.6/site-packages/WMCore/Database/CMSCouch.py", line 127, in checkForCouchError raise CouchUnauthorisedError(reason, data, result) CouchUnauthorisedError: CouchUnauthorisedError - reason: Unauthorized, data: {} result: None

ticoann commented 10 years ago

yes, It seems authentication problem. could you check whether you can access with the same cert you are using for agent? https://cmsweb-testbed.cern.ch/couchdb/ One thing is key you are using to access shouldn't be encrypted.

alexanderrichards commented 10 years ago

no seems I cannot access that web page using the cert. What is the command to create a key that is unencrypted?

ticoann commented 10 years ago

openssl pkcs12 -in pkcs12.pfx -nocerts -nodes -out my.key

alexanderrichards commented 10 years ago

I still have no luck runing tests

ticoann commented 10 years ago

It seems that timestamp is not part of workqueue element. https://github.com/dmwm/WMCore/pull/4977/files#diff-1ab10a7628256d3b161f4e66218ce294R379 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataStructs/WorkQueueElement.py#L13 Need to get the timestamp values some where else or put the timestamp in workqueue element

alexanderrichards commented 10 years ago

ah ok, timestamp is a part of CouchWorkQuque Element though: https://github.com/alexanderrichards/WMCore/blob/01d1b4fe3ac92cb0bf403e03ee0b9ab9f206ce98/src/python/WMCore/WorkQueue/DataStructs/CouchWorkQueueElement.py

alexanderrichards commented 10 years ago

Also looking at it again it looks like it should read more like

# sort elements to get them in timestamp order
elements = sorted(elements, key=lambda element: element.timestamp)

since CouchWorkQueue has the timestamp but WorkQueue does not, is it ok to assume that we will have just CouchWorkQueue objects or should I explicitly check, Also then in the case of non-CouchWorkQueue objects, where do we get this timestamp from?

Cheers Alex

ticoann commented 10 years ago

@alexanderrichards, Actually, I was able to fix that. #5015. I had to merge it since we need new tag today. Thanks for looking into it.

alexanderrichards commented 10 years ago

@ticoann ok thanks

ticoann commented 10 years ago

@alexanderrichards, actually could you review the code? We have still some hours to before making the tag. I did minimal tested in vm. But it would be nice if you can look at it as well.

alexanderrichards commented 10 years ago

yup I'll have a look now.

alexanderrichards commented 10 years ago

I've commented on the merge diff, see: https://github.com/dmwm/WMCore/pull/5015/files

As I have a question

alexanderrichards commented 10 years ago

Thanks for the clarification of the problem. I think it looks good then.

Cheers Alex