dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

EventsPerLumi option for TaskChain requests #4872

Closed amaltaro closed 10 years ago

amaltaro commented 10 years ago

Hi @ticoann, @lucacopa, WMA devs,

in addition to the issue #4871 , could you please also extend the EventsPerLumi capabilities to the TaskChain requests. So we can have longer jobs running without badly affecting lumi and file sizes.

I believe we would only use this option along the LheInputFiles=true, so when the TaskChain workflow does not need to read LHE files, we don't need to have this EventsPerLumi capability (offline guys may correct me here).

Thanks, Alan.

vlimant commented 10 years ago

Not only for LheInputFiles=true, but anytime things are run "from scratch"

alexanderrichards commented 10 years ago

@ticoann also done along with issue #4871 please see: https://github.com/alexanderrichards/WMCore/compare/LheInputFiles

amaltaro commented 10 years ago

Hi @ticoann @alexanderrichards , I finished my test and it seems this feature is still not ready for TaskChain.

Some details, I patched vocms142 agent (connected to testbed) and ran this workflow amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140228_165943_4574

if one look at Task1, we can see the lumi size should be 300 events, as in [1]. I also changed the EventsPerJob to 30k, so we ended up with 34 jobs here and we should get ~3333 lumis, but looking at DAS, there is only 34 lumi section in the output dataset, as one can see here [2].

Once we get this EventsPerLumi working properly, we could ask @vlimant /PPD/PdmV folks to validate the the output datasets (especially the LHEInputFiles and its trick skip events).

Seangchan, one more question, do you think this problem is in WMAgent only or it may be affecting ReqMgr as well?

Thanks, Alan.

[1] amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140228_165943_4574.request.schema.Task1 = {'KeepOutput': True, 'GlobalTag': 'START70_V4::All', 'SplittingAlgo': 'EventBased', 'ProcessingString': 'START70_V4', 'Seeding': 'AutomaticSeeding', 'ConfigCacheID': 'e5c59b6e699fa20fb0d4acdb9692198a', 'EventsPerLumi': 300, 'LheInputFiles': 'True', 'TaskName': 'QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'AcquisitionEra': 'CMSSW_7_0_0_pre11', 'PrimaryDataset': 'RelValQCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'EventsPerJob': 20000, 'RequestNumEvents': 1000000}

[2] https://cmsweb.cern.ch/das/request?view=list&limit=10&instance=global&input=summary+dataset%3D+%2FRelValQCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola%2FCMSSW_7_0_0_pre11-START70_V4_TEST_HG1403_Validation_LHE_RelVal-v1%2FGEN

alexanderrichards commented 10 years ago

@ticoann I'm not sure what to do with this now. I'll let you comment

amaltaro commented 10 years ago

I usually get these specs with the resubmit script, here it's the one for the workflow above:

{'Group': 'DATAOPS', 'Requestor': 'amaltaro', 'ScramArch': 'slc5_amd64_gcc481', 'SizePerEvent': 1234, 'Memory': 2400, 'Task1': {'KeepOutput': True, 'GlobalTag': 'START70_V4::All', 'SplittingAlgo': 'EventBased', 'ProcessingString': 'START70_V4', 'Seeding': 'AutomaticSeeding', 'ConfigCacheID': 'e5c59b6e699fa20fb0d4acdb9692198a', 'EventsPerLumi': 300, 'LheInputFiles': 'True', 'TaskName': 'QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'AcquisitionEra': 'CMSSW_7_0_0_pre11', 'PrimaryDataset': 'RelValQCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'EventsPerJob': 20000, 'RequestNumEvents': 1000000}, 'Task2': {'KeepOutput': True, 'GlobalTag': 'START70_V4::All', 'InputFromOutputModule': 'RAWSIMoutput', 'ProcessingString': 'START70_V4', 'SplittingAlgo': 'LumiBased', 'InputTask': 'QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'ConfigCacheID': 'e5c59b6e699fa20fb0d4acdb96922b7e', 'LumisPerJob': 10, 'TaskName': 'HARVGEN', 'AcquisitionEra': 'CMSSW_7_0_0_pre11'}, 'RequestType': 'TaskChain', 'timeStamp': 1393603185, 'TimePerEvent': 20, 'dashboardActivity': 'integration', 'ConfigCacheURL': 'https://cmsweb-testbed.cern.ch/couchdb', 'CouchDBName': 'reqmgr_config_cache', 'CMSSWVersion': 'CMSSW_7_0_0_pre11', 'unmergedLFNBase': '/store/unmerged', 'CouchWorkloadDBName': 'reqmgr_workload_cache', 'RequestPriority': 2000, 'mergedLFNBase': '/store/relval', 'ProcessingVersion': 4, 'RequestName': 'amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140228_165943_4574', 'RequestString': 'RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola', 'CouchURL': 'https://cmsweb-testbed.cern.ch/couchdb', 'RequestorDN': '/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues', 'GlobalTag': 'START70_V4::All', 'Campaign': 'HG1403_Validation', 'DbsUrl': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'RequestDate': [2014, 2, 28, 15, 59, 43], 'TaskChain': 2}
ticoann commented 10 years ago

Actually, the parameter is set to workflow level not in task level. (Probably that is not desired) Could you run the workflow setting 'EventsPerLumi': 300, 'LheInputFiles': 'True' in Workflow level in json """ {'Group': 'DATAOPS', 'Requestor': 'amaltaro', 'ScramArch': 'slc5_amd64_gcc481' , 'EventsPerLumi': 300, 'LheInputFiles': 'True', 'Task1: { ....}} """

Seangchan, one more question, do you think this problem is in WMAgent only or it may be affecting ReqMgr as well?

ReqMgr need to be updated. patch is created #5009. I will create the new tag with other fixes, Thank you very much for testing.

amaltaro commented 10 years ago

Just in case, while we had HG1403b in testbed (which means configurations under the workflow level and not in task level), I ran another workflow and again we didn't get the lumis is set by EventsPerLumi. Anyways, I'll run more tests now with this thing fixed in HG1403c. Thanks

ticoann commented 10 years ago

Thank you, Alan

alexanderrichards commented 10 years ago

@ticoann did the ReqMgr patch fix this? I'm not sure how to proceed otherwise

ticoann commented 10 years ago

@alexanderrichards, it seems that didn't fixed it. Alan will run other test with new agent. I will monitor the problem, then. Thanks

amaltaro commented 10 years ago

It still not working, just tested it in cmssrv94 using WMA 0.9.94c. The agent/job is creating one lumi per production job. See this workflow if needed: amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola_140320_112950_7560

which produced this dataset (it's in int/global dbs3 namespace): /RelValQCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola/CMSSW_7_0_0_pre11-START70_V4_TEST_Agent0994c_Validation_LHE_RelVal-v1/GEN

Please, investigate. AFAIK it's getting more important every day. Thanks, Alan.

vlimant commented 10 years ago

Could this https://github.com/vlimant/WMCore/commit/5f34a292fb3c5cc4d391a2d567f3102211026e91 be a lead to a solution ?

amaltaro commented 10 years ago

Hi Jean-Roch, I have a workflow running and I just connected to one job, unfortunately I do not see that job creating a new Lumi every 300 events (as set in the request).

vlimant commented 10 years ago

Hi Alan, if your test workflow is
https://cmsweb-testbed.cern.ch/reqmgr/view/showWorkload?requestName=amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468

that:

amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_1943391468.policies.section('start') amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468.policies.start.SliceSize = 30000 amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468.policies.start.policyName = 'MonteCarlo' amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140325_194339_1468.policies.start.SubSliceSize = 30000

tells me that things are bad already here:

https://github.com/dmwm/WMCore/blob/db32e6962acdfa2cc04e7b3a1d92660bbc402447/src/python/WMCore/WMSpec/StdSpecs/TaskChain.py#L249

as it does not seem to be finding events_per_lumi inside taskConf['SplittingArguments'], which tells me that modifyTaskConfiguration has not put it there.

Are you 100% sure you have applied my changes to the patched version you are using ?

amaltaro commented 10 years ago

Replying here as well. The patch is applied, double checked this morning. However we/Seangchan still think these changes are also required in the ReqMgr code

vlimant commented 10 years ago

Hi @amaltaro, do you mean that you have changed only on the agent side and not on the testbed request manager side ? If so, indeed I share the opinion. Can this be done so that we can test it ?

amaltaro commented 10 years ago

Yes, it can be done but not right away. We can push a new testbed deployment in the next week in case we have ReqMgr material.

vlimant commented 10 years ago

Hi @amaltaro , I also have a suggestion for the lheinputfile

https://github.com/vlimant/WMCore/commit/2f976480c6ce495690f0b34f17713adf078b95b7

which you should include and test at the earliest convenience. I wish we could push this in asap.

amaltaro commented 10 years ago

It looks like it's working now with ReqMgr 0.9.95pre3 in testbed and WMA 0.9.95pre1 + patches.

I ran this wf: amaltaro_RVCMSSW_7_0_0QCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola_140409_150453_540

which had lots of failures (probably due to a crappy CMSSW version, it's working fine for another wf in 700pre11...). But in the end we got 5 success production jobs where each one has produced 100 lumis, the output has 500 lumis as can be seen here curl -ks -X GET --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader/filesummaries?dataset=/RelValQCD_Ht-250To500_TuneZ2star_8TeV_madgraph-tauola/CMSSW_7_0_0-START70_V6_TEST_HG1404_Validation_LHE_RelVal-v1/GEN"; echo [{u'num_file': 1, u'num_lumi': 500, u'num_block': 1, u'num_event': 23090, u'file_size': 2619069214}]