dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Request assignment to > 1 team should fail #6917

Closed amaltaro closed 7 years ago

amaltaro commented 8 years ago

According to this constraint: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueReqMgrInterface.py#L238

requests will fail to get acquired by GQ. Instead, we should fail the assignment of such requests. This is one example that was bugging GQ: https://cmsweb.cern.ch/reqmgr2/data/request?name=mewu_RVCMSSW_8_0_10PhotonJets_Pt_10_13_160602_114021_8987

amaltaro commented 8 years ago

BTW, it seems there is no state transition verification for reqmgr2. I moved a reqmgr request to closed-out using reqmgr2 API (/reqmgr2/data/request/. Then I changed its status again to closed-out, but this time using reqmgr API (/reqmgr/reqMgr/closeout).

amaltaro commented 8 years ago

I found in the workqueue logs another request assigned to two team names (relval twice). The dictionary used to create the workflow was (from reqmgr2 logs, so data sanitization/default added on top of the original request):

{'AcquisitionEra': 'FAKE',
 'AllowOpportunistic': False,
 'AutoApproveSubscriptionSites': [],
 'BlockBlacklist': [],
 'BlockCloseMaxEvents': 25000000,
 'BlockCloseMaxFiles': 500,
 'BlockCloseMaxSize': 5000000000000,
 'BlockCloseMaxWaitTime': 66400,
 'BlockWhitelist': [],
 'CMSSWVersion': 'CMSSW_8_1_0_pre7',
 'Campaign': 'CMSSW_8_1_0_pre7',
 'ConfigCacheID': None,
 'ConfigCacheURL': 'https://cmsweb.cern.ch/couchdb',
 'ConfigCacheUrl': None,
 'CouchDBName': 'reqmgr_config_cache',
 'CouchURL': 'https://cmsweb.cern.ch/couchdb',
 'CouchWorkloadDBName': 'reqmgr_workload_cache',
 'CustodialGroup': 'DataOps',
 'CustodialSites': [],
 'CustodialSubType': 'Replica',
 'DQMConfigCacheID': '85ff0a90d773227202e94ffef666c055',
 'DQMHarvestUnit': 'byRun',
 'DQMSequences': [],
 'DQMUploadProxy': None,
 'DQMUploadUrl': 'https://cmsweb.cern.ch/dqm/relval',
 'Dashboard': '',
 'DashboardHost': 'cms-wmagent-job.cern.ch',
 'DashboardPort': 8884,
 'DbsUrl': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader',
 'DeleteFromSource': False,
 'EnableHarvesting': 'True',
 'EnableNewStageout': False,
 'FirstEvent': 1,
 'FirstLumi': 1,
 'GlobalTag': '81X_dataRun2_relval_v0',
 'GlobalTagConnect': None,
 'GracePeriod': 300,
 'Group': 'ppd',
 'IgnoredOutputModules': [],
 'IncludeParents': False,
 'InitialPriority': 500000,
 'LumiList': {},
 'MaxMergeEvents': 100000,
 'MaxMergeSize': 4294967296,
 'MaxRSS': 2411724,
 'MaxVSize': 20411724,
 'MaxWaitTime': 86400,
 'Memory': 3000,
 'MergedLFNBase': '/store/data',
 'MinMergeSize': 2147483648,
 'Multicore': 1,
 'NonCustodialGroup': 'DataOps',
 'NonCustodialSites': [],
 'NonCustodialSubType': 'Replica',
 'OutputDatasets': ['/ZeroBias/CMSSW_8_1_0_pre7-80X_dataRun2_HLT_relval_v11_RelVal_zb2015D-v1/FEVTDEBUGHLT',
                    '/ZeroBias/CMSSW_8_1_0_pre7-TkAlMinBias-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/ALCARECO',
                    '/ZeroBias/CMSSW_8_1_0_pre7-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/MINIAOD',
                    '/ZeroBias/CMSSW_8_1_0_pre7-EcalESAlign-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/ALCARECO',
                    '/ZeroBias/CMSSW_8_1_0_pre7-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/RECO',
                    '/ZeroBias/CMSSW_8_1_0_pre7-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/DQMIO',
                    '/ZeroBias/CMSSW_8_1_0_pre7-SiStripCalMinBias-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/ALCARECO',
                    '/ZeroBias/CMSSW_8_1_0_pre7-SiStripCalZeroBias-81X_dataRun2_relval_v0_RelVal_zb2015D-v1/ALCARECO'],
 'OverrideCatalog': None,
 'PeriodicHarvestInterval': 0,
 'PrepID': None,
 'ProcessingString': '',
 'ProcessingVersion': 1,
 'ReqMgr2Only': True,
 'RequestDate': [2016, 6, 15, 0, 52, 1],
 'RequestName': 'prebello_RVCMSSW_8_1_0_pre7RunZeroBias2015D__RelVal_zb2015D_160615_025201_1801',
 'RequestPriority': 500000,
 'RequestStatus': 'new',
 'RequestString': 'RVCMSSW_8_1_0_pre7RunZeroBias2015D__RelVal_zb2015D',
 'RequestTransition': [{'DN': u'/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=prebello/CN=672856/CN=Patricia Rebello Teles',
                        'Status': 'new',
                        'UpdateTime': 1465951921}],
 'RequestType': 'TaskChain',
 'RequestWorkflow': 'https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/prebello_RVCMSSW_8_1_0_pre7RunZeroBias2015D__RelVal_zb2015D_160615_025201_1801/spec',
 'Requestor': 'prebello',
 'RequestorDN': u'/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=prebello/CN=672856/CN=Patricia Rebello Teles',
 'RunNumber': 0,
 'ScramArch': 'slc6_amd64_gcc530',
 'SiteBlacklist': [],
 'SiteWhitelist': [],
 'SizePerEvent': 1234,
 'SoftTimeout': 129600,
 'SoftwareVersions': ['CMSSW_8_1_0_pre7'],
 'SubRequestType': 'RelVal',
 'SubscriptionPriority': 'Low',
 'Task1': {'AcquisitionEra': 'CMSSW_8_1_0_pre7',
           'ConfigCacheID': '85ff0a90d773227202e94ffef66561bb',
           'GlobalTag': '80X_dataRun2_HLT_relval_v11',
           'InputDataset': '/ZeroBias/Run2015D-v1/RAW',
           'KeepOutput': True,
           'LumiList': {'256677': [[1, 291],
                                   [293, 390],
                                   [392, 397],
                                   [400, 455],
                                   [457, 482]]},
           'LumisPerJob': 1,
           'Memory': 7500,
           'Multicore': 4,
           'ProcessingString': '80X_dataRun2_HLT_relval_v11_RelVal_zb2015D',
           'SplittingAlgo': 'LumiBased',
           'TaskName': 'HLTDR2_25ns'},
 'Task2': {'AcquisitionEra': 'CMSSW_8_1_0_pre7',
           'ConfigCacheID': '85ff0a90d773227202e94ffef666b5e9',
           'GlobalTag': '81X_dataRun2_relval_v0',
           'InputFromOutputModule': 'FEVTDEBUGHLToutput',
           'InputTask': 'HLTDR2_25ns',
           'KeepOutput': True,
           'LumisPerJob': 5,
           'Memory': 7500,
           'Multicore': 4,
           'ProcessingString': '81X_dataRun2_relval_v0_RelVal_zb2015D',
           'SplittingAlgo': 'LumiBased',
           'TaskName': 'RECODR2_25nsreHLT'},
 'TaskChain': 2,
 'Team': '',
 'TimePerEvent': 0.1,
 'TrustPUSitelists': False,
 'TrustSitelists': False,
 'UnmergedLFNBase': '/store/unmerged',
 'ValidStatus': 'PRODUCTION',
 'VoGroup': 'unknown',
 'VoRole': 'unknown',
 'dashboardActivity': 'relval',
 'mergedLFNBase': '/store/relval',
 'unmergedLFNBase': '/store/unmerged'}

and from Andrew's logging, these are the parameters PUT during assignment:

{'AcquisitionEra': u'CMSSW_8_1_0_pre7',
 'AutoApproveSubscriptionSites': [],
 'BlockCloseMaxEvents': 2000000,
 'BlockCloseMaxWaitTime': 28800,
 'CustodialSites': [],
 'CustodialSubType': 'Replica',
 'Dashboard': 'relval',
 'GracePeriod': 300,
 'MaxMergeEvents': 50000,
 'MaxMergeSize': 4294967296,
 'MaxRSS': {u'HLTDR2_25ns': 7680000, u'RECODR2_25nsreHLT': 7680000},
 'MaxVSize': 4394967000,
 'MergedLFNBase': '/store/relval',
 'MinMergeSize': 2147483648,
 'NonCustodialSites': [],
 'NonCustodialSubType': 'Replica',
 'ProcessingString': {u'HLTDR2_25ns': u'80X_dataRun2_HLT_relval_v11_RelVal_zb2015D',
                      u'RECODR2_25nsreHLT': u'81X_dataRun2_relval_v0_RelVal_zb2015D'},
 'ProcessingVersion': 1,
 'RequestName': 'prebello_RVCMSSW_8_1_0_pre7RunZeroBias2015D__RelVal_zb2015D_160615_025201_1801',
 'RequestStatus': 'assigned',
 'SiteBlacklist': [],
 'SiteWhitelist': 'T1_US_FNAL',
 'SoftTimeout': 129600,
 'Team': 'relval',
 'Teamrelval': 'checked',
 'TrustSitelists': True,
 'UnmergedLFNBase': '/store/unmerged',
 'action': 'Assign',
 'checkboxprebello_RVCMSSW_8_1_0_pre7RunZeroBias2015D__RelVal_zb2015D_160615_025201_1801': 'checked',
 'maxVSize': 4394967000}

These 3 parameters are used for reqmgr web assignment:

 'Teamrelval': 'checked',
 'action': 'Assign',
'checkboxprebello_RVCMSSW_8_1_0_pre7RunZeroBias2015D__RelVal_zb2015D_160615_025201_1801': 'checked',

and this one got deprecated a year'ish ago (capital M is correct) 'maxVSize': 4394967000

amaltaro commented 8 years ago

@AndrewLevin FYI

amaltaro commented 7 years ago

Fixed by https://github.com/dmwm/WMCore/pull/7353