dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

MonteCarlo request with empty WQE SiteWhitelist #7707

Closed amaltaro closed 7 years ago

amaltaro commented 7 years ago

As reported here: https://its.cern.ch/jira/browse/CMSCOMPPR-699

the workflow has a proper SiteWhitelist (and it was properly assigned) and even so the GQE have an empty SiteWhitelist list.

For the record, this is the dict posted during assignment

{u'AcquisitionEra': 'pPb502Winter16GS',
 u'AutoApproveSubscriptionSites': [u'T1_DE_KIT_Disk'],
 u'BlockCloseMaxEvents': 2000000,
 u'BlockCloseMaxWaitTime': 172800,
 u'CustodialSites': [],
 u'CustodialSubType': 'Replica',
 u'Dashboard': 'production',
 u'GracePeriod': 300,
 'HardTimeout': 159900,
 u'MaxMergeEvents': 200000,
 u'MaxMergeSize': 4294967296,
 u'MaxRSS': 2355200,
 u'MaxVSize': 4394967000,
 u'MergedLFNBase': '/store/himc',
 u'MinMergeSize': 2147483648,
 u'NonCustodialGroup': 'DataOps',
 u'NonCustodialSites': [u'T1_DE_KIT_Disk'],
 u'NonCustodialSubType': 'Replica',
 u'ProcessingString': 'MB_80X_mcRun2_pA_5TeV_v4',
 u'ProcessingVersion': 3,
 u'RequestStatus': u'assigned',
 u'SiteBlacklist': [],
 u'SiteWhitelist': [u'T1_IT_CNAF',
                    u'T2_DE_DESY',
                    u'T2_AT_Vienna',
                    u'T2_CH_CSCS',
                    u'T2_FI_HIP',
                    u'T2_TW_NCHC',
                    u'T2_UK_SGrid_RALPP',
                    u'T2_FR_GRIF_LLR',
                    u'T2_UK_SGrid_Bristol',
                    u'T2_PT_NCG_Lisbon',
                    u'T2_KR_KNU',
                    u'T1_ES_PIC',
                    u'T1_UK_RAL',
                    u'T2_IT_Legnaro',
                    u'T2_IT_Rome',
                    u'T2_UK_London_Brunel',
                    u'T2_RU_JINR',
                    u'T2_IT_Pisa',
                    u'T2_US_Vanderbilt',
                    u'T2_IN_TIFR',
                    u'T2_FR_CCIN2P3',
                    u'T2_CH_CERN_HLT',
                    u'T1_FR_CCIN2P3',
                    u'T2_FR_GRIF_IRFU',
                    u'T0_CH_CERN',
                    u'T2_UK_London_IC',
                    u'T2_IT_Bari',
                    u'T2_ES_CIEMAT',
                    u'T1_DE_KIT',
                    u'T2_FR_IPHC',
                    u'T2_RU_IHEP',
                    u'T2_HU_Budapest',
                    u'T2_CN_Beijing',
                    u'T2_US_MIT',
                    u'T2_BE_IIHE',
                    u'T2_CH_CERN',
                    u'T2_PL_Swierk',
                    u'T1_RU_JINR'],
 u'SoftTimeout': 159600,
 u'Team': 'production',
 u'TrustPUSitelists': False,
 u'TrustSitelists': False,
 u'UnmergedLFNBase': '/store/unmerged'}

global_workqueue logs show the following:

2017-03-09 01:43:27,115:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524/spec"
2017-03-09 01:43:29,436:INFO:WorkQueue:Splitting /pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524/Production with policy MonteCarlo params = {'ResubmitBlock': {'args': {}, 'name': 'Resub
mitBlock'}, 'MonteCarlo': {'args': {}, 'name': 'MonteCarlo'}, 'Dataset': {'args': {}, 'name': 'Dataset'}, 'Block': {'args': {}, 'name': 'Block'}, 'DatasetBlock': {'args': {}, 'name': 'Block'}}
2017-03-09 01:43:29,444:INFO:WorkQueue:Queuing element 2c0da21b2d78e9d93c600de309c834b7 for /pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524/Production with 1000 job(s) split with MonteC
arlo on events 1-478000
...
2017-03-09 01:43:29,452:INFO:WorkQueue:Queuing element 8815cb3f41831b7ab1ef7e79a3999c5e for /pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524/Production with 762 job(s) split with MonteCarlo on events 29636001-30000000
2017-03-09 01:43:58,541:INFO:WorkQueue:Split work for request(s): "pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524"
...
...
2017-03-09 01:49:27,157:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524/spec"
2017-03-09 01:49:27,799:INFO:WorkQueue:Resume splitting of "pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524"
2017-03-09 01:49:28,620:INFO:WorkQueue:Request "pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524" already split - Resuming
2017-03-09 01:49:28,620:INFO:WorkQueue:Split work for request(s): "pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524"

as can be seen, it tries to split work again twice. Checking...

amaltaro commented 7 years ago

I didn't manage to reproduce this issue in my VM. Just in case, this is the content of one of the GQE:

{"_id":"029bec0a615bdfa28e2a252e34e8cc71","_rev":"1-de6deb4cc51f0dab404f9d253e5760da","updatetime":1489020225.3563029766,"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"ParentQueueId":"pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524","NumberOfEvents":478000,"StartPolicy":"MonteCarlo","CreationTime":0,"NumOfFilesAdded":0,"Priority":85000,"ParentData":{},"PileupData":{},"RejectedInputs":[],"ParentFlag":false,"TaskName":"Production","Status":"Available","Inputs":{},"NumberOfLumis":5000.0,"Jobs":1000.0,"ParentQueueUrl":null,"ChildQueueUrl":null,"PercentSuccess":0,"PercentComplete":0,"WMBSUrl":null,"OpenForNewData":false,"EndPolicy":{"policyName":"SingleShot"},"FilesProcessed":0,"SiteWhitelist":[],"ACDC":{},"NumberOfFiles":0,"TeamName":"production","NoPileupUpdate":false,"RequestName":"pdmvserv_HIN-pPb502Winter16GS-00003_00003_v2_MB_170309_001506_524","SubscriptionId":null,"NoInputUpdate":false,"blowupFactor":1.0,"TimestampFoundNewData":0,"ProcessedInputs":[],"SiteBlacklist":[],"Mask":{"LastRun":1,"FirstRun":1,"inclusivemask":true,"runAndLumis":{},"LastEvent":17208000,"FirstEvent":16730001,"LastLumi":180000,"FirstLumi":175001},"Dbs":null,"EventsWritten":0},"timestamp":1489020225.3563029766,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement"}
vlimant commented 7 years ago

https://cmsweb.cern.ch/reqmgr2/fetch?rid=prebello_RVCMSSW_9_0_0NuGun__resub_170324_153902_8575 is another instance with https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/element/fe0f5c6d954f2e43daa5c37e88a73ef4

amaltaro commented 7 years ago

I was going to say it's related to the fact that this TaskChain starts with MC from scratch AND has the TrustSitelists flag enabled. However, the previous workflow reported was not assigned with any of those Trust flags enabled.

I'm looking at this issue and we should have it fixed for the next cmsweb production upgrade (next week).