dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Jobs for ACDC wfs not being submitted to sites with RSEs ending with _Disk #12101

Closed hassan11196 closed 2 days ago

hassan11196 commented 1 week ago

Impact of the bug WMAgent, WorkQueue

Describe the bug Jobs for ACDC workflow that reads Pileup locally i.e. TrustPUSitelists: false will not be submitted to the site if its phedex_name / Storage name ends with _Disk

Related to https://github.com/dmwm/WMCore/issues/12012 This issue is similar to the above-discussed issue which was recently fixed, Pull Request

Now this issue is limited to ACDC workflows.

For Example: This ACDC workflow was created after the fix was deployed to agents, https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240917_112116_3408

However its WorkQueue Pileup locations still includes the storage name for the site image

How to reproduce it Steps to reproduce the behavior:

  1. Create an ACDC workflow with TrustPUSitelists:false only to a site whose storage element ends with _Disk
  2. The Workflow will remain stuck in Acquired state and agents will not create jobs for it.

Expected behavior Jobs to be properly submitted to all sites in the site whitelist even if the SE and CE names are different i.e "T1_US_FNAL_Disk", "T1_US_FNAL". WorkQueue Pileup Locations to have computing site names instead of storage name for ACDC workflows.

Additional context and error message These 2 non-ACDC workflows were created after the fix was deployed but also have incorrect WorkQueue Pileup Locations.

  1. https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_TSG-Phase2Spring24wmLHEGS-00006__v1_T_240915_225306_5625

  2. https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_TSG-Phase2Spring24wmLHEGS-00012__v1_T_240914_161424_8704

FYI @amaltaro @anpicci

amaltaro commented 6 days ago

Thank you for reporting this, Ahmed.

Looking into the ACDC workflow/workqueue element under this link https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/element/8d243a44de187913148d3c27a9efb3d4, I can extract the following relevant information:

    "Inputs": {
      "/acdc/cmsunified_ACDC0_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240917_112116_3408/:pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136:TSG-Phase2Spring24wmLHEGS-00001_0:TSG-Phase2Spring24DIGIRECOMiniAOD-00105_0:TSG-Phase2Spring24DIGIRECOMiniAOD-00105_1/0/8": [
        "T2_CH_CERN_HLT",
        "T2_CH_CERN_P5",
        "T2_CH_CERN"
      ]
    },
...
    "SiteWhitelist": [
      "T1_US_FNAL",
      "T2_CH_CERN",
      "T2_CH_CERN_P5"
    ],
    "SiteBlacklist": [],
...
    "PileupData": {
      "/MinBias_TuneCP5_14TeV-pythia8/Phase2Spring24GS-140X_mcRun4_realistic_v4-v1/GEN-SIM": [
        "T1_US_FNAL_Disk",
        "T2_CH_CERN"
      ]
    },

so it looks like the PileupData info above needs to be converted to PSN as well.

In addition, if I look at the original ACDC collection under: https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136%22&include_docs=true&reduce=false

and search for that specific fileset named after:

  "InitialTaskPath": "/pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136/TSG-Phase2Spring24wmLHEGS-00001_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_1",

here is one full document:

{'id': '5ed3a93702dc6448baee3f494f3dde32',
 'key': 'pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136',
 'value': {'_rev': '1-c0daf77c40aef82ff969da57b2808d5b',
  '_id': '5ed3a93702dc6448baee3f494f3dde32'},
 'doc': {'_id': '5ed3a93702dc6448baee3f494f3dde32',
  '_rev': '1-c0daf77c40aef82ff969da57b2808d5b',
  'collection_name': 'pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136',
  'collection_type': 'ACDC.CollectionTypes.DataCollection',
  'fileset_name': '/pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136/TSG-Phase2Spring24wmLHEGS-00001_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_1',
  'files': {'/store/unmerged/Phase2Spring24DIGIRECOMiniAOD/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/PU200_Trk1GeV_140X_mcRun4_realistic_v4-v2/2560000/615ee026-93b2-458e-9b25-11ea9f1981ec.root': {'last_event': 0,
    'first_event': 0,
    'lfn': '/store/unmerged/Phase2Spring24DIGIRECOMiniAOD/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/PU200_Trk1GeV_140X_mcRun4_realistic_v4-v2/2560000/615ee026-93b2-458e-9b25-11ea9f1981ec.root',
    'locations': ['T2_CH_CERN'],
    'id': 18796449,
    'checksums': {},
    'events': 1000,
    'merged': '0',
    'size': 64881560795,
    'runs': [{'run_number': 1, 'lumis': [71]}],
    'parents': []}},
  'acdc_version': 2,
  'timestamp': 1725560053.592352}}

so we might need to check as well the component that uploads these documents to the ACDCServer (I guess it is ErrorHandler), such that locations would be consistent (I suppose it is really meant to be RSE location...)

amaltaro commented 6 days ago

Having another look into this, I just noticed that the bug-fix that Kenyi made 10 days ago: https://github.com/dmwm/WMCore/pull/12094

was not deployed yet in production. This is the reason why we still don't site names defined for PileupData in the workqueue elements.

About ErrorHandler, here is how we deal with the file locations: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/ErrorHandler/ErrorHandlerPoller.py#L206

and I don't think there is anything to be changed on this component, as it is indeed expected to be a list of locations/RSEs.

Sorry for the miscommunication on this, we applied the relevant fix to the agents, but not to central services.

I am moving it over to Waiting right now and we might end up closing this as "not planned" (not actually an issue).

amaltaro commented 3 days ago

@hassan11196 we have pushed in a hot-fix today for Global WorkQueue, version 2.3.5.1. I scanned all of the Resubmission workflows in acquired status in production, but none of them are using pileup data that is available at FNAL, so I could not cross-check this fix.

If you create more ACDC workflows and/or know any ACDC that is requiring pileup available at FNAL, can you please check that and/or let us know. Thanks

hassan11196 commented 3 days ago

Hello @amaltaro Thank you for the hotfix. I can confirm that its working for ACDCs, I had created this ACDC request and its WorkQueue Pileup locations are as expected. cmsunified_ACDC0_task_TSG-Phase2Spring24GS-00152__v1_T_240926_211603_3410

image

Thank you.

amaltaro commented 2 days ago

Awesome! Thank you for promptly looking and validating this. With that, I am closing this issue as "Not Planned", as it has actually being fixed by another Issue/PR. Thanks again!