cms-PdmV / cmsPdmV

CERN CMS McM repository
4 stars 10 forks source link

Allow `production_manager` to force complete below 50% #1114

Closed tvami closed 4 months ago

tvami commented 10 months ago

Is your feature related to a problem?

For example if I want to force complete the root request https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=EXO-RunIISummer20UL18wmLHEGEN-04185&page=0&shown=127

I get the following msg "Request is below 50 percent completion"

Screenshot 2023-12-11 at 09 05 32

It is not actually true, going to https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIISummer20UL18wmLHEGEN-04185 we are at like 92%. I understand this is because McM is checking the output of the GEN campaign and since the GEN is not stored we get the "Request is below 50 percent completion" msg.

Describe the solution you'd like

(1) Not sure how complicated it would be to reroute this check so it looks at the stats in the MiniAOD...

(2) For now I'd be happy with the following solution:

--> currently only Sunil can do this step and allowing the production_manager-s to do it will make Sunil's life easier.

Current behavior

Pinging Sunil to force-complete such requests

Expected behavior

Pinging production_manager-s to do it. Or if the solution (1) is implemented, no need to ping anyone

@ggonzr @sunilUIET @DickyChant

tvami commented 5 months ago

Hi @ggonzr @lmoureaux I was wondering if you could have a look at this issue, this comes up quite often

lmoureaux commented 5 months ago

After checking the Unified code, it seems it will force-complete all requests in a workflow whenever any request in the workflow is on the force-complete list. Thus force-completing some request whose output is kept should in principle be enough. Why do you need to force-complete the LHEGEN to begin with? Does McM somehow fail to notice that the root request sharing the same workflow was force-completed?

tvami commented 5 months ago

Exactly, I did not see any evidence that force-completing let's say the MiniAOD step would force-complete the whole workflow. If that was the case, that's all good. But it doesnt seem to work, I'll try to find an example and let you know

tvami commented 5 months ago

Ok so I force-completed the Nano step here: https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=EXO-Run3Summer22NanoAODv13-00006&page=0&shown=127 let's see if it will ever propagate to unified https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-Run3Summer22GS-00348

lmoureaux commented 5 months ago

Unified saw the force-complete but has a bug: image

tvami commented 5 months ago

Thanks @lmoureaux will you report this to O&C or should I?

lmoureaux commented 5 months ago

It's not a critical error, it only affects email sending. You can see that the workflow is now completed in ReqMgr. Next question is whether McM will see it :sweat_smile:

tvami commented 5 months ago

It's not that McM doesnt see (maybe just yet) the full request to be force-completed, but the single step of the NANO is still insubmitted status: https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=EXO-Run3Summer22NanoAODv13-00006&page=0&shown=127

lmoureaux commented 5 months ago

Yes, that's what I meant. First Stats must notice the change, then McM must notice the change in Stats (this we can speed up by clicking the 🗘 Update button).

tvami commented 5 months ago

ok fingers crossed! but I think in the past this was the part that didnt work (and I never looked at Unified logs so concluded that it doesnt work --> made this issue in the first place)

lmoureaux commented 5 months ago

Performed a manual update of Stats. From the script log:

[2024-04-23 21:35:38,913][INFO] Will update event count for 11361/15024
[2024-04-23 21:35:38,913][INFO] Updating event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426
[2024-04-23 21:35:38,927][INFO] POST request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/datasetlist with data 
{
  "dataset": [
    "/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM",
    "/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM",
    "/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM"
  ],
  "dataset_access_type": "*",
  "detail": 1
}
 took 0.009s
[2024-04-23 21:35:38,933][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM&validFileOnly=1 took 0.006s
[2024-04-23 21:35:38,933][INFO] Setting 100000 events, 262420717 size and PRODUCTION type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 21:35:38,940][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM&validFileOnly=1 took 0.007s
[2024-04-23 21:35:38,941][INFO] Setting 100000 events, 40499534558 size and PRODUCTION type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 21:35:38,947][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM&validFileOnly=1 took 0.006s
[2024-04-23 21:35:38,947][INFO] Setting 97963 events, 7765182295 size and PRODUCTION type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 21:35:38,947][INFO] Did not update event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 because it did not change. Time: 0.034s
tvami commented 5 months ago

I'm not sure I understand what

11361/15024

means...

But the rest of the log, i.e. that there was supposed to be 100000 events produced from which according to Dima webpage 97% was done seems all good!

And I still see McM in submit status

lmoureaux commented 5 months ago

11361/15024 is the progress, the update script checked 15k requests. Stats still sees the dataset as PRODUCTION, investigating...

lmoureaux commented 5 months ago

I can see dataset_access_type = "VALID" in WMAgent. This is what the update script uses and should've picked up. We see in the log that it got PRODUCTION instead, maybe DBS just hadn't noticed yet. Let's give it some more time.

lmoureaux commented 5 months ago

It's now VALID in Stats. Update script log:

[2024-04-23 22:50:07,601][INFO] Updating event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426
[2024-04-23 22:50:07,611][INFO] POST request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/datasetlist with data 
{
  "dataset": [
    "/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM",
    "/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM",
    "/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM"
  ],
  "dataset_access_type": "*",
  "detail": 1
}
 took 0.007s
[2024-04-23 22:50:07,619][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM&validFileOnly=1 took 0.007s
[2024-04-23 22:50:07,619][INFO] Setting 100000 events, 262420717 size and VALID type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 22:50:07,626][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM&validFileOnly=1 took 0.007s
[2024-04-23 22:50:07,626][INFO] Setting 100000 events, 40499534558 size and VALID type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 22:50:07,632][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM&validFileOnly=1 took 0.006s
[2024-04-23 22:50:07,632][INFO] Setting 97963 events, 7765182295 size and VALID type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 22:50:07,646][INFO] Updated event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 in 0.031s
[2024-04-23 22:50:07,646][INFO] Trigger outside for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 (StepChain)
[2024-04-23 22:50:07,646][INFO] Returning OAuth2 credentials for requesting a token
[2024-04-23 22:50:07,646][INFO] Triggering outside for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426
[2024-04-23 22:50:07,646][ERROR] Exception while doing a GET to /mcm/restapi/requests/fetch_stats_by_wf/cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426: Remote end closed connection without response
[2024-04-23 22:50:08,646][INFO] Creating HTTPS connection for cms-pdmv-prod.web.cern.ch
[2024-04-23 22:50:08,985][INFO] GET request to https://cms-pdmv-prod.web.cern.ch/mcm/restapi/requests/fetch_stats_by_wf/cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 took 1.339s

The requests weren't moved to done automatically but doing Inspect status moved them. I think we run Inspect status automatically from time to time (@ggonzr can you confirm?).

I think there are two items for this issue:

tvami commented 5 months ago

but doing Inspect status moved them

is this something that a generator_contact can do?

ggonzr commented 5 months ago

Hi @lmoureaux, @tvami, indeed, this process is automatically performed by a batch job every 12h by this script. Only users with production_manager permissions or higher can execute it

lmoureaux commented 5 months ago

Opened #1135 with the improved error message.

but doing Inspect status moved them

is this something that a generator_contact can do?

Looks like anyone can do it (it's in no way a destructive action so no reason to restrict it).

tvami commented 5 months ago

Thanks @lmoureaux !