dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSOutput failing to create Tape data placement - missing RSE attribute #12044

Closed amaltaro closed 1 month ago

amaltaro commented 1 month ago

Impact of the bug MSOutput

Describe the bug As we further debug this MSRuleCleaner issue: https://github.com/dmwm/WMCore/issues/12042

which kept increasing the number of workflows to be archived, it turns out MSOutput is potentially the root cause for holding workflows from being archived, given that it is failing - at least for 5 days now - to consume the workflow documents and create the final Tape output data placement. Logs from today can be found in [1]

How to reproduce it Remove one of those Tape RSE endpoint attributes that is expected to be there(?)

Expected behavior Simple, tape data placement should be performed without any issues. That includes, having Rucio and the relevant RSEs properly configured, according to what MSOutput has been using and configured to, such as:

data.rucioTapeExpression = "rse_type=TAPE&wmcore_output_tape=True\cms_type=test"
data.rucioRSEAttribute = "ddm_quota"  # UPDATE: it is meant to be dm_weight

Additional context and error message [1]

2024-07-17 00:18:11,124:INFO:MSOutput: All the disk requests succeeded for: cmsunified_task_EXO-RunIISummer20UL18wmLHEGEN-04997__v1_T_240226_151822_904. Marking it as 'done'
2024-07-17 00:18:11,131:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T1_ES_PIC_Tape/attr/ HTTP/1.1" 200 502
2024-07-17 00:18:11,137:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T1_DE_KIT_Tape/attr/ HTTP/1.1" 200 829
2024-07-17 00:18:11,144:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T1_UK_RAL_Tape/attr/ HTTP/1.1" 200 563
2024-07-17 00:18:11,150:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T1_US_FNAL_Tape/attr/ HTTP/1.1" 200 604
2024-07-17 00:18:11,157:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T1_IT_CNAF_Tape/attr/ HTTP/1.1" 200 549
2024-07-17 00:18:11,162:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T1_FR_CCIN2P3_Tape/attr/ HTTP/1.1" 200 537
2024-07-17 00:18:11,162:ERROR:MSOutput: MSOutputConsumer PipelineNonRelVal General error from pipeline. Err: list index out of range. Will retry again in the next cycle.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSOutput/MSOutput.py", line 527, in msOutputConsumer
    pipeLine.run(docOut)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 137, in run
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 137, in <lambda>
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 69, in __call__
    return self.run(obj)
  File "/usr/local/lib/python3.8/site-packages/Utils/Pipeline.py", line 72, in run
    return self.func(obj, *self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSOutput/MSOutput.py", line 391, in makeTapeSubscriptions
    tapeRSE, requiresApproval = self._getTapeDestination(dataBytesForTape)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSOutput/MSOutput.py", line 461, in _getTapeDestination
    return self.rucio.pickRSE(rseExpression=self.msConfig["rucioTapeExpression"],
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/Rucio/Rucio.py", line 762, in pickRSE
    return weightedChoice(rsesWithApproval, rsesWeight)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/Rucio/RucioUtils.py", line 46, in weightedChoice
    listChoice = random.choices(population=rses, weights=rseWeights, k=1)
  File "/usr/local/lib/python3.8/random.py", line 406, in choices
    total = cum_weights[-1] + 0.0   # convert to float
IndexError: list index out of range
2024-07-17 00:18:11,163:INFO:MSOutput: Processed 0 workflows from pipeline: MSOutputConsumer PipelineNonRelVal
2024-07-17 00:18:11,163:INFO:MSOutput: MSOutputConsumer: Total 5 requests processed. 
2024-07-17 00:18:11,163:INFO:MSManager: Total outputConsumer execution time: 13 secs
amaltaro commented 1 month ago

Something is very wrong with CERN gitlab! According to the HEAD of the prod branch, the RSE attribute name is ddm_quota: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/blob/prod/reqmgr2ms-output/config-output.py

but I remember changing this a few months ago, here is the merge request: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/244

but somehow it does not even show up in the history of that file: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/commits/prod/reqmgr2ms-output/config-output.py?ref_type=heads