Closed tvami closed 6 months ago
Hi @ggonzr @lmoureaux I was wondering if you could have a look at this issue, this comes up quite often
After checking the Unified code, it seems it will force-complete all requests in a workflow whenever any request in the workflow is on the force-complete list. Thus force-completing some request whose output is kept should in principle be enough. Why do you need to force-complete the LHEGEN to begin with? Does McM somehow fail to notice that the root request sharing the same workflow was force-completed?
Exactly, I did not see any evidence that force-completing let's say the MiniAOD step would force-complete the whole workflow. If that was the case, that's all good. But it doesnt seem to work, I'll try to find an example and let you know
Ok so I force-completed the Nano step here: https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=EXO-Run3Summer22NanoAODv13-00006&page=0&shown=127 let's see if it will ever propagate to unified https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-Run3Summer22GS-00348
Unified saw the force-complete but has a bug:
Thanks @lmoureaux will you report this to O&C or should I?
It's not a critical error, it only affects email sending. You can see that the workflow is now completed in ReqMgr. Next question is whether McM will see it :sweat_smile:
It's not that McM doesnt see (maybe just yet) the full request to be force-completed, but the single step of the NANO is still insubmitted
status: https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=EXO-Run3Summer22NanoAODv13-00006&page=0&shown=127
Yes, that's what I meant. First Stats must notice the change, then McM must notice the change in Stats (this we can speed up by clicking the 🗘 Update button).
ok fingers crossed! but I think in the past this was the part that didnt work (and I never looked at Unified logs so concluded that it doesnt work --> made this issue in the first place)
Performed a manual update of Stats. From the script log:
[2024-04-23 21:35:38,913][INFO] Will update event count for 11361/15024
[2024-04-23 21:35:38,913][INFO] Updating event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426
[2024-04-23 21:35:38,927][INFO] POST request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/datasetlist with data
{
"dataset": [
"/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM",
"/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM",
"/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM"
],
"dataset_access_type": "*",
"detail": 1
}
took 0.009s
[2024-04-23 21:35:38,933][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM&validFileOnly=1 took 0.006s
[2024-04-23 21:35:38,933][INFO] Setting 100000 events, 262420717 size and PRODUCTION type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 21:35:38,940][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM&validFileOnly=1 took 0.007s
[2024-04-23 21:35:38,941][INFO] Setting 100000 events, 40499534558 size and PRODUCTION type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 21:35:38,947][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM&validFileOnly=1 took 0.006s
[2024-04-23 21:35:38,947][INFO] Setting 97963 events, 7765182295 size and PRODUCTION type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 21:35:38,947][INFO] Did not update event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 because it did not change. Time: 0.034s
I'm not sure I understand what
11361/15024
means...
But the rest of the log, i.e. that there was supposed to be 100000 events produced from which according to Dima webpage 97% was done seems all good!
And I still see McM in submit status
11361/15024
is the progress, the update script checked 15k requests. Stats still sees the dataset as PRODUCTION
, investigating...
I can see dataset_access_type = "VALID"
in WMAgent. This is what the update script uses and should've picked up. We see in the log that it got PRODUCTION
instead, maybe DBS just hadn't noticed yet. Let's give it some more time.
It's now VALID
in Stats. Update script log:
[2024-04-23 22:50:07,601][INFO] Updating event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426
[2024-04-23 22:50:07,611][INFO] POST request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/datasetlist with data
{
"dataset": [
"/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM",
"/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM",
"/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM"
],
"dataset_access_type": "*",
"detail": 1
}
took 0.007s
[2024-04-23 22:50:07,619][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM&validFileOnly=1 took 0.007s
[2024-04-23 22:50:07,619][INFO] Setting 100000 events, 262420717 size and VALID type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22NanoAODv13-133X_mcRun3_2022_realistic_ForNanov13_v1-v2/NANOAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 22:50:07,626][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM&validFileOnly=1 took 0.007s
[2024-04-23 22:50:07,626][INFO] Setting 100000 events, 40499534558 size and VALID type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v2/AODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 22:50:07,632][INFO] GET request to https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=/GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM&validFileOnly=1 took 0.006s
[2024-04-23 22:50:07,632][INFO] Setting 97963 events, 7765182295 size and VALID type for /GluGluSpin0To2G_W-0p014_M-2250_TuneCP5_13p6TeV_pythia8/Run3Summer22MiniAODv4-130X_mcRun3_2022_realistic_v5-v2/MINIAODSIM (cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426)
[2024-04-23 22:50:07,646][INFO] Updated event count for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 in 0.031s
[2024-04-23 22:50:07,646][INFO] Trigger outside for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 (StepChain)
[2024-04-23 22:50:07,646][INFO] Returning OAuth2 credentials for requesting a token
[2024-04-23 22:50:07,646][INFO] Triggering outside for cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426
[2024-04-23 22:50:07,646][ERROR] Exception while doing a GET to /mcm/restapi/requests/fetch_stats_by_wf/cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426: Remote end closed connection without response
[2024-04-23 22:50:08,646][INFO] Creating HTTPS connection for cms-pdmv-prod.web.cern.ch
[2024-04-23 22:50:08,985][INFO] GET request to https://cms-pdmv-prod.web.cern.ch/mcm/restapi/requests/fetch_stats_by_wf/cmsunified_task_EXO-Run3Summer22GS-00348__v1_T_240208_123725_4426 took 1.339s
The requests weren't moved to done
automatically but doing Inspect status
moved them. I think we run Inspect status
automatically from time to time (@ggonzr can you confirm?).
I think there are two items for this issue:
fetch_stats_by_wf
worksforce complete
when the output is not kept to hint at the solutionbut doing
Inspect status
moved them
is this something that a generator_contact
can do?
Hi @lmoureaux, @tvami, indeed, this process is automatically performed by a batch job every 12h by this script. Only users with production_manager
permissions or higher can execute it
Opened #1135 with the improved error message.
but doing
Inspect status
moved themis this something that a
generator_contact
can do?
Looks like anyone can do it (it's in no way a destructive action so no reason to restrict it).
Thanks @lmoureaux !
Is your feature related to a problem?
For example if I want to force complete the root request https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=EXO-RunIISummer20UL18wmLHEGEN-04185&page=0&shown=127
I get the following msg "Request is below 50 percent completion"
It is not actually true, going to https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIISummer20UL18wmLHEGEN-04185 we are at like 92%. I understand this is because McM is checking the output of the GEN campaign and since the GEN is not stored we get the "Request is below 50 percent completion" msg.
Describe the solution you'd like
(1) Not sure how complicated it would be to reroute this check so it looks at the stats in the MiniAOD...
(2) For now I'd be happy with the following solution:
production_manager
to force complete below 50%--> currently only Sunil can do this step and allowing the
production_manager
-s to do it will make Sunil's life easier.Current behavior
Pinging Sunil to force-complete such requests
Expected behavior
Pinging
production_manager
-s to do it. Or if the solution (1) is implemented, no need to ping anyone@ggonzr @sunilUIET @DickyChant