cms-PdmV / cmsPdmV

CERN CMS McM repository
4 stars 10 forks source link

Tackling `Dead` tagged requests #1144

Open DickyChant opened 3 days ago

DickyChant commented 3 days ago

Is your feature related to a problem?

On McM, some requests are tagged as Dead.

The tagging happens when a request has its corresponding reqmgr status stalled at an intermediate stage of production for a period of time or without a reqmgr name. The reason for a Dead tag would get even more complicated.

Currently, there is no feature on McM to tackle those requests, leading to maybe years of delay for certain productions.

Describe the solution you'd like

As described above, the solution would be two aspects:

  1. A detailed breakdown of different categories of Dead requests, which would need different reactions
  2. Functionalities that tackle different requests correspondingly.

A good reference can be found at /afs/cern.ch/user/v/vlimant/public/ppd/investigate.py.

Current behavior

Currently, only labeling/tagging exists

Expected behavior

Let's discuss how exactly we want these to work, but a minimal feature would be Be able to toggle the needed requests to done status

Several bonus points could be:

  1. Further integrate the checks into frontend / implement as periodically running inspection (at least a monitoring page similar to this one?)
  2. This could also be done for StatusNew if we go beyond toggling a set of requests to done.

@vlimant and @hassan11196

lmoureaux commented 3 days ago

I think handling this properly would require adding error state(s) to the request state machine. However, touching such a fundamental piece is very risky given the state of disarray the code is in.

What kind of requests would you manually send to submit-done? The meaning of this state is that the events are available in DAS. Checking it should be automatic.

Looking at your monitoring page, the only category that might be sent to submit-done is "Dead", not "StatusNew", reqmgr seems fine with "annouced" or "normal-archived". The first request in this category is marked done on the computing side but the dataset status in DAS is still PRODUCTION. So at least for this one there is a CompOps problem and McM is right to not mark it as submit-done.

vlimant commented 3 days ago

what we miss, minimalistically, is a way to toggle to "done". either "by hand", or else a (one day old) list is toggle_done.json

DickyChant commented 3 days ago

To add, "announced" or "normal-archived", although being intermediate stages, actually mean that the samples can be used, so from our point of view it would be beneficial to have a way of doing such...

vlimant commented 3 days ago

could likely be (re)enabling /restapi/requests//force sufficient. although https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2219 inspect_submitted is not doing what it is fully expected to do

vlimant commented 3 days ago

one subclass : concerning all those https://cms-pdmv-prod.web.cern.ch/mcm/requests?flown_with=flowRunIISummer20UL16RECOWmassALCA someone went and edit the sequence somehow, and 'datatier': ['GEN-SIM-RECO,ALCARECO'], instead of datatier': ['GEN-SIM-RECO','ALCARECO'], https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L1119

vlimant commented 3 days ago

that one : https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=GEN-Run3Summer23BPixwmLHEGS-00456 got resubmitted, but the reqmgr_name was not reset properly, leaving the content of https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_GEN-Run3Summer23BPixwmLHEGS-00456__v1_T_240311_122338_5567 in the way of checking and setting this done https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2263

lmoureaux commented 3 days ago

Describing specific problems is usually more useful than asking for a specific solution. A solution might already exist that you don't know about and is not the one you came up with.

I'm reluctant to introducing an API endpoint to set to done because I think this would be a quite error-prone action with no going back. Nevertheless we have this:

https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/main.py#L309

inspect_submitted is not doing what it is fully expected to do

Then describe your expectations and let's fix the bug instead of piling up another hack on top of the existing stack.

To add, "announced" or "normal-archived", although being intermediate stages, actually mean that the samples can be used, so from our point of view it would be beneficial to have a way of doing such...

McM also requires the dataset to be VALID:

https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2211

To me datasets that stay in PRODUCTION status while the wf is in announced or normal-archived is a CompOps issue. Like this one: DAS, CompOps. Whenever this is fixed by CompOps McM will move forward (it may take a manual Stats refresh but this we can add).

one subclass : concerning all those https://cms-pdmv-prod.web.cern.ch/mcm/requests?flown_with=flowRunIISummer20UL16RECOWmassALCA someone went and edit the sequence somehow, and 'datatier': ['GEN-SIM-RECO,ALCARECO'], instead of datatier': ['GEN-SIM-RECO','ALCARECO'],

That's quite common, you'll see this in campaigns and flows for datatier, eventcontent, and step. You'll even find some mixed cases like ["A,B", "C"]. As long as the cmsDrivers work people don't seem to care.

that one : https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=GEN-Run3Summer23BPixwmLHEGS-00456 got resubmitted, but the reqmgr_name was not reset properly, leaving the content of https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_GEN-Run3Summer23BPixwmLHEGS-00456__v1_T_240311_122338_5567 in the way of checking and setting this done

Any idea how this happened? Is there a bug somewhere that needs fixing? IMHO understanding and fixing the bug would help us more than setting individual requests to done.

ggonzr commented 2 days ago

Many thanks to all for the highlights. I see several cases. I would like to include some details:

Related to:

one subclass : concerning all those https://cms-pdmv-prod.web.cern.ch/mcm/requests?flown_with=flowRunIISummer20UL16RECOWmassALCA someone went and edit the sequence somehow, and 'datatier': ['GEN-SIM-RECO,ALCARECO'], instead of datatier': ['GEN-SIM-RECO','ALCARECO'],

That's quite common, you'll see this in campaigns and flows for datatier, eventcontent, and step. You'll even find some mixed cases like ["A,B", "C"]. As long as the cmsDrivers work people don't seem to care.

The format is relevant for the datatier attribute and it will be used by the collect_output function. If it is not properly formatted and does not match with the tiers expected:

https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2155-L2166

the output is not properly retrieved and the transition to done will fail by:

https://github.com/cms-PdmV/cmsPdmV/blob/018739d475bc33509f0b64be877445f4fb103d14/mcm/json_layer/request.py#L2246-L2251

I patched an example of this in the past PPD-Phase2Spring24DIGIRECOMiniAOD-00020 and after inspecting it again request was set to done. We could explore retrieving the subset of Dead requests with this kind of behavior, patch the attribute and inspect them again, ideally this should solve the problem.

Related to:

that one : https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=GEN-Run3Summer23BPixwmLHEGS-00456 got resubmitted, but the reqmgr_name was not reset properly, leaving the content of https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_GEN-Run3Summer23BPixwmLHEGS-00456__v1_T_240311_122338_5567 in the way of checking and setting this done

Any idea how this happened? Is there a bug somewhere that needs fixing? IMHO understanding and fixing the bug would help us more than setting individual requests to done.

I agree with @lmoureaux. @hassan11196, could you provide more details about why the ReqMgr2 request cmsunified_task_GEN-Run3Summer23BPixwmLHEGS-00444__v1_T_240221_073331_395 is in normal-archived but its output datasets are invalid, for instance: /TbarWplusto4Q_MT-171p5_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer23BPixDRPremix-130X_mcRun3_2023_realistic_postBPix_v5-v2/AODSIM? Shouldn't this ReqMgr2 request be in the rejected-archived status?

To conclude, related to:

To me datasets that stay in PRODUCTION status while the wf is in announced or normal-archived is a CompOps issue. Like this one: DAS, CompOps. Whenever this is fixed by CompOps McM will move forward (it may take a manual Stats refresh but this we can add).

I agree again. I see the Unified status in away, @hassan11196 shouldn't it be in the closed status?

lmoureaux commented 1 day ago

Regarding data tiers, this is the logic I use to sanitize them in my database imports.