dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSOutput: workflows are not marked as done if the service is in dry-run mode #9796

Closed amaltaro closed 4 years ago

amaltaro commented 4 years ago

Impact of the bug ReqMgr2MS - MSOutput

Describe the bug If workflows don't get marked as done, they will keep getting evaluated cycle after cycle. Which means, once the service is fully enabled in production, it will go through all the previous documents already available in the database (unless we drop the database during the deployment).

How to reproduce it none

Expected behavior Once MSOutputConsumer has successfully consumed a workflow output placement - be the service fully operational or in dry-run mode - we should update the transfer document and mark it as done, such that that workflow doesn't come up again in the next cycle. Of course, if there is any error while handling that workflow, then it still needs to be kept for the future cycles.

In addition to this, we should distinguish between dry-run mode and broken transfer submission, such that we know what can be marked as done or not, instead of:

2020-07-04 03:41:54,127:WARNING:MSOutput: No data found in ddmResults for amaltaro_RVCMSSW_7_0_0_pre11QCD_Ht-100To250_TuneZ2star_8TeV_madgraph-tauola_140411_203127_5403. Either dry run mode or broken transfer submission to DDM. ddmResults: 

Additional context and error message One example among many others in the logs in vocms0731:

2020-07-04 03:41:54,123:INFO:MSOutput: Making transfer subscriptions for amaltaro_ACDC_SC_ReDigi_Harvest_HG2006_Val_200604_213413_8537
2020-07-04 03:41:54,124:WARNING:MSOutput: No data found in ddmResults for amaltaro_ACDC_SC_ReDigi_Harvest_HG2006_Val_200604_213413_8537. Either dry run mode or broken transfer submission to DDM. ddmResults: 
[{'cache': None,
  'group': 'DataOps',
  'item': [u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/MINIAODSIM',
           u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/DQMIO',
           u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-RECO',
           u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch2-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-DIGI-RAW',
           u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch1-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM'],
  'n': None,
  'site': [u'T2_*', u'T1_*_Disk']}]
2020-07-04 03:41:54,125:INFO:MSOutput: MSOutputConsumer:140230885041920@vocms0731.cern.ch: PipelineNonRelVal: Processed 'msOutDoc' with '_id': amaltaro_ACDC_SC_ReDigi_Harvest_HG2006_Val_200604_213413_8537.
2020-07-04 03:41:54,126:DEBUG:MSOutput: {u'Campaign': u'HG2006_Val',
 u'OutputDatasets': [u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch1-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM',
                     u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch2-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-DIGI-RAW',
                     u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-RECO',
                     u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/DQMIO',
                     u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/MINIAODSIM'],
 u'RequestName': u'amaltaro_ACDC_SC_ReDigi_Harvest_HG2006_Val_200604_213413_8537',
 u'_id': u'amaltaro_ACDC_SC_ReDigi_Harvest_HG2006_Val_200604_213413_8537',
 u'campaignOutputMap': [{u'campaignName': u'HG2006_Val',
                         u'datasets': [u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch1-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM',
                                       u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch2-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-DIGI-RAW',
                                       u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-RECO',
                                       u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/DQMIO',
                                       u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/MINIAODSIM']}],
 u'creationTime': 1593776502,
 u'destination': [u'T2_*', u'T1_*_Disk'],
 u'destinationOutputMap': [{u'datasets': [u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/MINIAODSIM',
                                          u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/DQMIO',
                                          u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch3-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-RECO',
                                          u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch2-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM-DIGI-RAW',
                                          u'/RelValQCD_FlatPt_15_3000HS_13/CMSSW_10_6_1_patch1-SC_ReDigi_Harvest_HG2006_Val_Alanv11-v11/GEN-SIM'],
                            u'destination': [u'T2_*', u'T1_*_Disk']}],
 u'isRelVal': False,
 u'isTaken': False,
 u'isTakenBy': None,
 u'lastUpdate': 1593826914,
 u'numberOfCopies': None,
 u'transferIDs': None,
 u'transferStatus': u'incomplete'}
todor-ivanov commented 4 years ago

@amaltaro I was considering both options (marking them as done and also not marking them) but at the end I deliberately left the workflows worked in DRY-RUN mode as not marked, because we at the end, do not act on any of them. And also if we mark them then we would have only a single log of no more than 30 workflows in testbed and that wouould be our whole testing sample, which I considered quite not enough. But yes, you are right about the historical records in the database. I will switch the behavior of the service.

amaltaro commented 4 years ago

Thanks Todor. There is no need to push it over the weekend though - I do not think it's a blocker for production - so we can target the August production deployment.

todor-ivanov commented 4 years ago

Totally agreed!