Closed amaltaro closed 10 years ago
Not only for LheInputFiles=true, but anytime things are run "from scratch"
@ticoann also done along with issue #4871 please see: https://github.com/alexanderrichards/WMCore/compare/LheInputFiles
Hi @ticoann @alexanderrichards , I finished my test and it seems this feature is still not ready for TaskChain.
Some details, I patched vocms142 agent (connected to testbed) and ran this workflow amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140228_165943_4574
if one look at Task1, we can see the lumi size should be 300 events, as in [1]. I also changed the EventsPerJob to 30k, so we ended up with 34 jobs here and we should get ~3333 lumis, but looking at DAS, there is only 34 lumi section in the output dataset, as one can see here [2].
Once we get this EventsPerLumi working properly, we could ask @vlimant /PPD/PdmV folks to validate the the output datasets (especially the LHEInputFiles and its trick skip events).
Seangchan, one more question, do you think this problem is in WMAgent only or it may be affecting ReqMgr as well?
Thanks, Alan.
[1] amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140228_165943_4574.request.schema.Task1 = {'KeepOutput': True, 'GlobalTag': 'START70_V4::All', 'SplittingAlgo': 'EventBased', 'ProcessingString': 'START70_V4', 'Seeding': 'AutomaticSeeding', 'ConfigCacheID': 'e5c59b6e699fa20fb0d4acdb9692198a', 'EventsPerLumi': 300, 'LheInputFiles': 'True', 'TaskName': 'QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'AcquisitionEra': 'CMSSW_7_0_0_pre11', 'PrimaryDataset': 'RelValQCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'EventsPerJob': 20000, 'RequestNumEvents': 1000000}
@ticoann I'm not sure what to do with this now. I'll let you comment
I usually get these specs with the resubmit script, here it's the one for the workflow above:
{'Group': 'DATAOPS', 'Requestor': 'amaltaro', 'ScramArch': 'slc5_amd64_gcc481', 'SizePerEvent': 1234, 'Memory': 2400, 'Task1': {'KeepOutput': True, 'GlobalTag': 'START70_V4::All', 'SplittingAlgo': 'EventBased', 'ProcessingString': 'START70_V4', 'Seeding': 'AutomaticSeeding', 'ConfigCacheID': 'e5c59b6e699fa20fb0d4acdb9692198a', 'EventsPerLumi': 300, 'LheInputFiles': 'True', 'TaskName': 'QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'AcquisitionEra': 'CMSSW_7_0_0_pre11', 'PrimaryDataset': 'RelValQCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'EventsPerJob': 20000, 'RequestNumEvents': 1000000}, 'Task2': {'KeepOutput': True, 'GlobalTag': 'START70_V4::All', 'InputFromOutputModule': 'RAWSIMoutput', 'ProcessingString': 'START70_V4', 'SplittingAlgo': 'LumiBased', 'InputTask': 'QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'ConfigCacheID': 'e5c59b6e699fa20fb0d4acdb96922b7e', 'LumisPerJob': 10, 'TaskName': 'HARVGEN', 'AcquisitionEra': 'CMSSW_7_0_0_pre11'}, 'RequestType': 'TaskChain', 'timeStamp': 1393603185, 'TimePerEvent': 20, 'dashboardActivity': 'integration', 'ConfigCacheURL': 'https://cmsweb-testbed.cern.ch/couchdb', 'CouchDBName': 'reqmgr_config_cache', 'CMSSWVersion': 'CMSSW_7_0_0_pre11', 'unmergedLFNBase': '/store/unmerged', 'CouchWorkloadDBName': 'reqmgr_workload_cache', 'RequestPriority': 2000, 'mergedLFNBase': '/store/relval', 'ProcessingVersion': 4, 'RequestName': 'amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140228_165943_4574', 'RequestString': 'RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'CouchURL': 'https://cmsweb-testbed.cern.ch/couchdb', 'RequestorDN': '/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues', 'GlobalTag': 'START70_V4::All', 'Campaign': 'HG1403_Validation', 'DbsUrl': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'RequestDate': [2014, 2, 28, 15, 59, 43], 'TaskChain': 2}
Actually, the parameter is set to workflow level not in task level. (Probably that is not desired) Could you run the workflow setting 'EventsPerLumi': 300, 'LheInputFiles': 'True' in Workflow level in json """ {'Group': 'DATAOPS', 'Requestor': 'amaltaro', 'ScramArch': 'slc5_amd64_gcc481' , 'EventsPerLumi': 300, 'LheInputFiles': 'True', 'Task1: { ....}} """
Seangchan, one more question, do you think this problem is in WMAgent only or it may be affecting ReqMgr as well?
ReqMgr need to be updated. patch is created #5009. I will create the new tag with other fixes, Thank you very much for testing.
Just in case, while we had HG1403b in testbed (which means configurations under the workflow level and not in task level), I ran another workflow and again we didn't get the lumis is set by EventsPerLumi. Anyways, I'll run more tests now with this thing fixed in HG1403c. Thanks
Thank you, Alan
@ticoann did the ReqMgr patch fix this? I'm not sure how to proceed otherwise
@alexanderrichards, it seems that didn't fixed it. Alan will run other test with new agent. I will monitor the problem, then. Thanks
It still not working, just tested it in cmssrv94 using WMA 0.9.94c. The agent/job is creating one lumi per production job. See this workflow if needed: amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola_140320_112950_7560
which produced this dataset (it's in int/global dbs3 namespace): /RelValQCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola/CMSSW_7_0_0_pre11-START70_V4_TEST_Agent0994c_Validation_LHE_RelVal-v1/GEN
Please, investigate. AFAIK it's getting more important every day. Thanks, Alan.
Could this https://github.com/vlimant/WMCore/commit/5f34a292fb3c5cc4d391a2d567f3102211026e91 be a lead to a solution ?
Hi Jean-Roch, I have a workflow running and I just connected to one job, unfortunately I do not see that job creating a new Lumi every 300 events (as set in the request).
Hi Alan,
if your test workflow is
https://cmsweb-testbed.cern.ch/reqmgr/view/showWorkload?requestName=amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468
that:
amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_1943391468.policies.section('start') amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468.policies.start.SliceSize = 30000 amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468.policies.start.policyName = 'MonteCarlo' amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468.policies.start.SubSliceSize = 30000
tells me that things are bad already here:
as it does not seem to be finding events_per_lumi inside taskConf['SplittingArguments'], which tells me that modifyTaskConfiguration has not put it there.
Are you 100% sure you have applied my changes to the patched version you are using ?
Replying here as well. The patch is applied, double checked this morning. However we/Seangchan still think these changes are also required in the ReqMgr code
Hi @amaltaro, do you mean that you have changed only on the agent side and not on the testbed request manager side ? If so, indeed I share the opinion. Can this be done so that we can test it ?
Yes, it can be done but not right away. We can push a new testbed deployment in the next week in case we have ReqMgr material.
Hi @amaltaro , I also have a suggestion for the lheinputfile
https://github.com/vlimant/WMCore/commit/2f976480c6ce495690f0b34f17713adf078b95b7
which you should include and test at the earliest convenience. I wish we could push this in asap.
It looks like it's working now with ReqMgr 0.9.95pre3 in testbed and WMA 0.9.95pre1 + patches.
I ran this wf: amaltaro_RVCMSSW_7_0_0QCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola_140409_150453_540
which had lots of failures (probably due to a crappy CMSSW version, it's working fine for another wf in 700pre11...). But in the end we got 5 success production jobs where each one has produced 100 lumis, the output has 500 lumis as can be seen here curl -ks -X GET --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/filesummaries?dataset=/RelValQCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola/CMSSW_7_0_0-START70_V6_TEST_HG1404_Validation_LHE_RelVal-v1/GEN"; echo [{u'num_file': 1, u'num_lumi': 500, u'num_block': 1, u'num_event': 23090, u'file_size': 2619069214}]
Hi @ticoann, @lucacopa, WMA devs,
in addition to the issue #4871 , could you please also extend the EventsPerLumi capabilities to the TaskChain requests. So we can have longer jobs running without badly affecting lumi and file sizes.
I believe we would only use this option along the LheInputFiles=true, so when the TaskChain workflow does not need to read LHE files, we don't need to have this EventsPerLumi capability (offline guys may correct me here).
Thanks, Alan.