Open DickyChant opened 3 days ago
I think handling this properly would require adding error state(s) to the request state machine. However, touching such a fundamental piece is very risky given the state of disarray the code is in.
What kind of requests would you manually send to submit-done
? The meaning of this state is that the events are available in DAS. Checking it should be automatic.
Looking at your monitoring page, the only category that might be sent to submit-done
is "Dead", not "StatusNew", reqmgr seems fine with "annouced" or "normal-archived". The first request in this category is marked done on the computing side but the dataset status in DAS is still PRODUCTION
. So at least for this one there is a CompOps problem and McM is right to not mark it as submit-done
.
what we miss, minimalistically, is a way to toggle to "done". either "by hand", or else a (one day old) list is toggle_done.json
To add, "announced" or "normal-archived", although being intermediate stages, actually mean that the samples can be used, so from our point of view it would be beneficial to have a way of doing such...
could likely be (re)enabling /restapi/requests/inspect_submitted
is not doing what it is fully expected to do
one subclass : concerning all those https://cms-pdmv-prod.web.cern.ch/mcm/requests?flown_with=flowRunIISummer20UL16RECOWmassALCA someone went and edit the sequence somehow, and 'datatier': ['GEN-SIM-RECO,ALCARECO'],
instead of datatier': ['GEN-SIM-RECO','ALCARECO'],
https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L1119
that one : https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=GEN-Run3Summer23BPixwmLHEGS-00456 got resubmitted, but the reqmgr_name was not reset properly, leaving the content of https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_GEN-Run3Summer23BPixwmLHEGS-00456__v1_T_240311_122338_5567 in the way of checking and setting this done https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2263
Describing specific problems is usually more useful than asking for a specific solution. A solution might already exist that you don't know about and is not the one you came up with.
I'm reluctant to introducing an API endpoint to set to done because I think this would be a quite error-prone action with no going back. Nevertheless we have this:
https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/main.py#L309
inspect_submitted is not doing what it is fully expected to do
Then describe your expectations and let's fix the bug instead of piling up another hack on top of the existing stack.
To add, "announced" or "normal-archived", although being intermediate stages, actually mean that the samples can be used, so from our point of view it would be beneficial to have a way of doing such...
McM also requires the dataset to be VALID
:
To me datasets that stay in PRODUCTION
status while the wf is in announced
or normal-archived
is a CompOps issue. Like this one: DAS, CompOps. Whenever this is fixed by CompOps McM will move forward (it may take a manual Stats refresh but this we can add).
one subclass : concerning all those https://cms-pdmv-prod.web.cern.ch/mcm/requests?flown_with=flowRunIISummer20UL16RECOWmassALCA someone went and edit the sequence somehow, and 'datatier': ['GEN-SIM-RECO,ALCARECO'], instead of datatier': ['GEN-SIM-RECO','ALCARECO'],
That's quite common, you'll see this in campaigns and flows for datatier
, eventcontent
, and step
. You'll even find some mixed cases like ["A,B", "C"]
. As long as the cmsDriver
s work people don't seem to care.
that one : https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=GEN-Run3Summer23BPixwmLHEGS-00456 got resubmitted, but the reqmgr_name was not reset properly, leaving the content of https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_GEN-Run3Summer23BPixwmLHEGS-00456__v1_T_240311_122338_5567 in the way of checking and setting this done
Any idea how this happened? Is there a bug somewhere that needs fixing? IMHO understanding and fixing the bug would help us more than setting individual requests to done
.
Many thanks to all for the highlights. I see several cases. I would like to include some details:
Related to:
one subclass : concerning all those https://cms-pdmv-prod.web.cern.ch/mcm/requests?flown_with=flowRunIISummer20UL16RECOWmassALCA someone went and edit the sequence somehow, and 'datatier': ['GEN-SIM-RECO,ALCARECO'], instead of datatier': ['GEN-SIM-RECO','ALCARECO'],
That's quite common, you'll see this in campaigns and flows for
datatier
,eventcontent
, andstep
. You'll even find some mixed cases like["A,B", "C"]
. As long as thecmsDriver
s work people don't seem to care.
The format is relevant for the datatier
attribute and it will be used by the collect_output
function. If it is not properly formatted and does not match with the tiers expected:
the output is not properly retrieved and the transition to done
will fail by:
I patched an example of this in the past PPD-Phase2Spring24DIGIRECOMiniAOD-00020 and after inspecting it again request was set to done. We could explore retrieving the subset of Dead
requests with this kind of behavior, patch the attribute and inspect them again, ideally this should solve the problem.
Related to:
that one : https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=GEN-Run3Summer23BPixwmLHEGS-00456 got resubmitted, but the reqmgr_name was not reset properly, leaving the content of https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_GEN-Run3Summer23BPixwmLHEGS-00456__v1_T_240311_122338_5567 in the way of checking and setting this done
Any idea how this happened? Is there a bug somewhere that needs fixing? IMHO understanding and fixing the bug would help us more than setting individual requests to
done
.
I agree with @lmoureaux. @hassan11196, could you provide more details about why the ReqMgr2 request cmsunified_task_GEN-Run3Summer23BPixwmLHEGS-00444__v1_T_240221_073331_395 is in normal-archived
but its output datasets are invalid, for instance: /TbarWplusto4Q_MT-171p5_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer23BPixDRPremix-130X_mcRun3_2023_realistic_postBPix_v5-v2/AODSIM? Shouldn't this ReqMgr2 request be in the rejected-archived
status?
To conclude, related to:
To me datasets that stay in
PRODUCTION
status while the wf is inannounced
ornormal-archived
is a CompOps issue. Like this one: DAS, CompOps. Whenever this is fixed by CompOps McM will move forward (it may take a manual Stats refresh but this we can add).
I agree again. I see the Unified status in away
, @hassan11196 shouldn't it be in the closed
status?
Regarding data tiers, this is the logic I use to sanitize them in my database imports.
Is your feature related to a problem?
On McM, some requests are tagged as
Dead
.The tagging happens when a request has its corresponding reqmgr status stalled at an intermediate stage of production for a period of time or without a reqmgr name. The reason for a
Dead
tag would get even more complicated.Currently, there is no feature on McM to tackle those requests, leading to maybe years of delay for certain productions.
Describe the solution you'd like
As described above, the solution would be two aspects:
Dead
requests, which would need different reactionsA good reference can be found at
/afs/cern.ch/user/v/vlimant/public/ppd/investigate.py.
Current behavior
Currently, only labeling/tagging exists
Expected behavior
Let's discuss how exactly we want these to work, but a minimal feature would be Be able to toggle the needed requests to
done
statusSeveral bonus points could be:
StatusNew
if we go beyond toggling a set of requests todone
.@vlimant and @hassan11196