Open jang00777 opened 2 years ago
A new workflow affected by this issue: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer20UL16NanoAODAPVv9-06836
It would be good to finally get these workflows out of the system. They haven't been touched since June
Workflows with this issue for now : https://its.cern.ch/jira/browse/CMSCOMPPR-31118
Thanks @jang00777 , indeed the impact of the issue is ramping up. This link dynamically shows the list of affected workflows: https://its.cern.ch/jira/issues/?jql=labels%20%3D%20TotalInputEventsMissing
As the issue is growing up, we need to figure out
Just for logging purposes.
After going back 3 months of GlobalWorkQueue
logs (reqmanagerInteractionTask-wrokqueue-*
) here is the only type of error I find regarding those workflows:
2022-10-12 12:49:44,477:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/spec"
2022-10-12 12:49:44,713:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
2022-10-12 12:49:44,791:INFO:WorkQueue:Splitting /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/HIG-RunIISummer20UL16wmLHEGENAPV-07316_0 with policy name MonteCarlo and policy params {'name': 'MonteCarlo', 'args': {}}
2022-10-12 12:49:44,841:INFO:Rucio:Container: /Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX with container-based location at: {'T2_CH_CERN', 'T1_US_FNAL_Disk'}
2022-10-12 12:49:44,843:INFO:WorkQueue:Work splitting completed with 1 units, 0 rejectedWork and 0 badWork
2022-10-12 12:49:44,843:INFO:WorkQueue:Queuing element 035a98b66e90dfdcf71826d001a12ad2 for /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/HIG-RunIISummer20UL16wmLHEGENAPV-07316_0 with policy MonteCarlo, with 217 job(s) and 217 lumis on events 1-216021
2022-10-12 12:49:45,182:ERROR:WorkQueue:Exception splitting wqe cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646 for cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646: url=https://cmsweb.cern.ch:8443/reqmgr2/data/request/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646, code=403, reason=Forbidden, headers={'Date': 'Wed, 12 Oct 2022 10:49:45 GMT', 'Server': 'Apache', 'Content-Type': 'text/html;charset=utf-8', 'Content-Length': '750', 'X-Rest-Status': '200', 'X-Error-Http': '403', 'X-Error-Id': 'dfc9450c0d1a0103d975e1c4382f469c', 'X-Error-Detail': 'You are not allowed to access this resource.', 'X-Rest-Time': '1700.640 us', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=9358 t=1665571785172282'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n <title>403 Forbidden</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n </style>\n</head>\n <body>\n <h2>403 Forbidden</h2>\n <p>You are not allowed to access this resource.</p>\n <pre id="traceback"></pre>\n <div id="powered_by">\n <span>\n Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n </span>\n </div>\n </body>\n</html>\n'
Traceback (most recent call last):
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1150, in processInboundWork
self.reqmgrSvc.updateRequestStats(inbound['WMSpec'].name(), totalStats)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 206, in updateRequestStats
self.updateRequestProperty(request, stats)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 223, in updateRequestProperty
return self["requests"].put('request/%s' % request, propDict)[0]['result']
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 150, in put
return self.makeRequest(uri, data, 'PUT', incoming_headers,
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 173, in makeRequest
result, response = self.makeRequest_pycurl(uri, data, verb, headers)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 190, in makeRequest_pycurl
response, result = self.reqmgr.request(uri, data, headers, verb=verb,
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
return callFunc(callObj, newUrl, *args, **kwargs)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 351, in request
raise exc
http.client.HTTPException: url=https://cmsweb.cern.ch:8443/reqmgr2/data/request/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646, code=403, reason=Forbidden, headers={'Date': 'Wed, 12 Oct 2022 10:49:45 GMT', 'Server': 'Apache', 'Content-Type': 'text/html;charset=utf-8', 'Content-Length': '750', 'X-Rest-Status': '200', 'X-Error-Http': '403', 'X-Error-Id': 'dfc9450c0d1a0103d975e1c4382f469c', 'X-Error-Detail': 'You are not allowed to access this resource.', 'X-Rest-Time': '1700.640 us', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=9358 t=1665571785172282'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n <title>403 Forbidden</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n </style>\n</head>\n <body>\n <h2>403 Forbidden</h2>\n <p>You are not allowed to access this resource.</p>\n <pre id="traceback"></pre>\n <div id="powered_by">\n <span>\n Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n </span>\n </div>\n </body>\n</html>\n'
2022-10-12 12:49:45,228:ERROR:WorkQueueReqMgrInterface:Unknown error processing cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646
Traceback (most recent call last):
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueueReqMgrInterface.py", line 108, in queueNewRequests
units = queue.queueWork(workLoadUrl, request=reqName, team=team)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 641, in queueWork
work = self.processInboundWork(inbound, throw=True)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1150, in processInboundWork
self.reqmgrSvc.updateRequestStats(inbound['WMSpec'].name(), totalStats)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 206, in updateRequestStats
self.updateRequestProperty(request, stats)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 223, in updateRequestProperty
return self["requests"].put('request/%s' % request, propDict)[0]['result']
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 150, in put
return self.makeRequest(uri, data, 'PUT', incoming_headers,
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 173, in makeRequest
result, response = self.makeRequest_pycurl(uri, data, verb, headers)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 190, in makeRequest_pycurl
response, result = self.reqmgr.request(uri, data, headers, verb=verb,
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
return callFunc(callObj, newUrl, *args, **kwargs)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 351, in request
raise exc
http.client.HTTPException: url=https://cmsweb.cern.ch:8443/reqmgr2/data/request/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646, code=403, reason=Forbidden, headers={'Date': 'Wed, 12 Oct 2022 10:49:45 GMT', 'Server': 'Apache', 'Content-Type': 'text/html;charset=utf-8', 'Content-Length': '750', 'X-Rest-Status': '200', 'X-Error-Http': '403', 'X-Error-Id': 'dfc9450c0d1a0103d975e1c4382f469c', 'X-Error-Detail': 'You are not allowed to access this resource.', 'X-Rest-Time': '1700.640 us', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=9358 t=1665571785172282'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n <title>403 Forbidden</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n </style>\n</head>\n <body>\n <h2>403 Forbidden</h2>\n <p>You are not allowed to access this resource.</p>\n <pre id="traceback"></pre>\n <div id="powered_by">\n <span>\n Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n </span>\n </div>\n </body>\n</html>\n'
And later, the splitting has been resumed:
...
2022-10-12 12:55:06,814:INFO:WorkQueue:Workflow cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646 has no OpenRunningTimeout. Queuing to be closed.
...
2022-10-12 12:55:41,209:INFO:WorkQueueReqMgrInterface:Processing request cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646 at https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/spec
2022-10-12 12:55:41,209:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/spec"
2022-10-12 12:55:41,288:INFO:WorkQueue:Resume splitting of "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646"
2022-10-12 12:55:41,288:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
2022-10-12 12:55:41,311:INFO:WorkQueue:Request "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646" already split - Resuming
2022-10-12 12:55:41,312:INFO:WorkQueue:Split work for request(s): "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646"
2022-10-12 12:55:41,333:INFO:WorkQueueReqMgrInterface:1 units(s) queued for "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646"
And those calls that are failing are exactly related to updating the documents in couchDB
through Reqmgr2
interface and adding the splitting information to it:
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 150, in put
return self.makeRequest(uri, data, 'PUT', incoming_headers,
And looking closer it proves to be exactly as @amaltaro mentioned could be the case. There are two separate steps that are happening (and in this case failing):
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1150, in processInboundWork
self.reqmgrSvc.updateRequestStats(inbound['WMSpec'].name(), totalStats)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 206, in updateRequestStats
self.updateRequestProperty(request, stats)
File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 223, in updateRequestProperty
return self["requests"].put('request/%s' % request, propDict)[0]['result']
And this happens only for workflows(or elements - still not sure to which exactly this flag relates) with continuous
flag set to False
: if not continuous: ...
, which is visible also from the logs:
2022-10-12 12:49:44,713:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
This leads to transitioning the workflow without updating its statistics from the GlobalWorkGueue. And since GWQ does not reiterate through workflows in acquired
state the end result is the missing workflow parameters.
Now we can try two approaches here:
Even though this is now better understood, we also need to find why did we have those code=403, reason=Forbidden, X-Error-Detail': 'You are not allowed to access this resource.
errors at the first place. @haozturk do you remember noticing something in that regards in the past?
@todor-ivanov sorry for not leaving some ideas on how to tackle this. This is what I would do for the short-term solution:
Once you cover these, we can touch base again on how to deal with the other workflows:
Please let me know if you have any question; and please let's review this script before running it for real.
Thanks for the hint @amaltaro.
But those workflows that are suffering the issue seems to be missing any InputDataSet
. Here I started working on a script to be used for the manual fix of the already broken workflows and here [1] is the full list of workflows found together with their type and input data. So we are not able to use any information from DBS and we should maybe think of the pieces from GW that need to be rerun in order to get the proper results for those missing parameter.
[1] badWfList.txt
Okay, then it will be even easier because we are now certain that these do not require any input sub-set (runs, blocks, lumis).
I had to check a couple of workflows to make sure that we perform the correct calculation, which has to use the following formula:
totalEvents = int(RequestNumEvents / FilterEfficiency)
totalLumis = math.ceil(totalEvents / EventsPerJob) # we must round it up!
The final dictionary that we will have to post to ReqMgr2 will then be:
{"TotalInputEvents": totalEvents, # must be integer
"TotalInputLumis": totalLumis, # must be integer
"TotalEstimatedJobs": 1, # let us just hard-code it to 1. Hasan doesn't care anyways
"TotalInputFiles": 0} # and this we keep hard-coded to 0, which is about right for MC anyways
Thanks for the hint @amaltaro : I found two more things while implementing your suggestion:
TotalInputLumis
as the ratio:
totalLumis = math.ceil(totalEvents / EventsPerLumis)
Here is an example:
{'EventsPerJob': 3000,
'EventsPerLumi': 1000,
'FilterEfficiency': 1,
'InputDataset': None,
'RequestName': 'cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-05963__v1_T_220707_152315_1096',
'RequestNumEvents': 400000,
'RequestType': 'StepChain',
'SplittingAlgo': 'EventBased',
'SubRequestType': 'ReDigi',
'TotalEstimatedJobs': 134,
'TotalInputEvents': 400000,
'TotalInputFiles': 0,
'TotalInputLumis': 400}
* There are also two workflows that are actually TaskChains with `SplittingAlgo: EventAwareLumiBased`, which are having `InputDataset` . So it seems for those I'll have to implement and the queries to DBS after all. Here they are:
2022-11-21 13:53:18,151:INFO:fetchFromReqmgr: Found a workflow with NonEventBased Splitting algorithm: 2022-11-21 13:53:18,151:INFO:fetchFromReqmgr: {'EventsPerJob': 66976, 'EventsPerLumi': 66976, 'FilterEfficiency': 1, 'InputDataset': '/TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM', 'RequestName': 'pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319', 'RequestNumEvents': None, 'RequestType': 'TaskChain', 'SplittingAlgo': 'EventAwareLumiBased', 'SubRequestType': 'ReDigi', 'TotalEstimatedJobs': None, 'TotalInputEvents': None, 'TotalInputFiles': None, 'TotalInputLumis': None} ... 2022-11-21 13:53:18,158:INFO:fetchFromReqmgr: Found a workflow with NonEventBased Splitting algorithm: 2022-11-21 13:53:18,158:INFO:fetchFromReqmgr: {'EventsPerJob': 68899, 'EventsPerLumi': 68899, 'FilterEfficiency': 1, 'InputDataset': '/NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-1000_MY-80_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM', 'RequestName': 'pdmvserv_task_HIG-RunIISummer20UL16NanoAODAPVv9-06906__v1_T_221008_072906_6717', 'RequestNumEvents': None, 'RequestType': 'TaskChain', 'SplittingAlgo': 'EventAwareLumiBased', 'SubRequestType': 'ReDigi', 'TotalEstimatedJobs': None, 'TotalInputEvents': None, 'TotalInputFiles': None, 'TotalInputLumis': None}
Thank you for this correction, Todor. This:
totalEvents = int(RequestNumEvents / FilterEfficiency)
totalLumis = math.ceil(totalEvents / EventsPerLumis)
is indeed the correct formula.
hi @amaltaro ,
Please take a look at the latest commit in https://github.com/dmwm/WMCore/pull/11366/ where I implemented the actual estimation of the missing statistics for workflows with EventAwareLumiBased
splitting and InputDataset
.
And here are the results for the two workflows I was mentioning before:
2022-11-22 14:46:28,454:INFO:fetchFromReqmgr: Found a workflow with EventAwareLumiBased Splitting algorithm:
2022-11-22 14:46:28,608:INFO:fetchFromReqmgr:
{'BlockList': None,
'EventsPerJob': 66976,
'EventsPerLumi': 66976,
'FilterEfficiency': 1,
'InputDataset': '/TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM',
'LumiList': {},
'RequestName': 'pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319',
'RequestNumEvents': None,
'RequestType': 'TaskChain',
'RunList': None,
'SplittingAlgo': 'EventAwareLumiBased',
'SubRequestType': 'ReDigi',
'TotalEstimatedJobs': 992,
'TotalInputEvents': 397395,
'TotalInputFiles': 23,
'TotalInputLumis': 401}
2022-11-22 14:46:28,609:INFO:fetchFromReqmgr: Found a workflow with EventAwareLumiBased Splitting algorithm:
2022-11-22 14:46:28,854:INFO:fetchFromReqmgr:
{'BlockList': None,
'EventsPerJob': 68899,
'EventsPerLumi': 68899,
'FilterEfficiency': 1,
'InputDataset': '/NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-1000_MY-80_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM',
'LumiList': {},
'RequestName': 'pdmvserv_task_HIG-RunIISummer20UL16NanoAODAPVv9-06906__v1_T_221008_072906_6717',
'RequestNumEvents': None,
'RequestType': 'TaskChain',
'RunList': None,
'SplittingAlgo': 'EventAwareLumiBased',
'SubRequestType': 'ReDigi',
'TotalEstimatedJobs': 928,
'TotalInputEvents': 215995,
'TotalInputFiles': 31,
'TotalInputLumis': 233}
@todor-ivanov the calculation is looking okay and I left some comments in your PR.
However, I would like to ask you to stick to the KISS principle. I see tons of over-complication and things being done in a likely unnecessary manner. For instance:
filter
input argument? Do we actually need it to fix these workflows?mask
input argument? Do we actually need it to fix these workflows?Thanks for the input @amaltaro
About:
what is the filter input argument? Do we actually need it to fix these workflows? what is the mask input argument? Do we actually need it to fix these workflows?
I just started the script with a different idea and approach on how to execute it, but then I switched to a more generic construction with having the WMcore services instanciated directly into the script and by this way giving the ability to use them through an interactive shell..... Anyway they just became obsolete, but I did not bother to remove them because this script so far is not supposed to be merged. If you want me I can remove those. The rest of your comments were addressed in my latest commit.
Hi @amaltaro Finally I did the last bit of it, and I am capable of running the script in dryRun
mode to go through the system and find all workflows which need to be fixed. Here fixMissingStats.log I upload a log from such a run.
@haozturk if you could take a quick look and see if this is all that Unified things needs to be fixed I am going to upload the workflows' statistics in a single push. But that'll be a one time action. So once we upload the numbers we will be having bigger troubles to find and fix those in case we have done something wrong. That said, I'd suggest to check a workflow or two just to see if things look reasonable. Thanks in advance!
@todor-ivanov thanks Todor. I looked into a couple of stats and it looks good to me. Your update call seem to have collected the correct properties as well. From my side, you can go ahead and fix those stats.
Thanks @amaltaro, I just did so!
I confirm that we don't see workflows w/ missing TotalInput[Events, Lumis, Files] anymore. Thanks a lot @todor-ivanov @amaltaro !
Hi @amaltaro,
I found this old issue for the missing Params problems, I am attaching the recently affected worklfows here.
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_B2G-Run3Summer22EEwmLHEGS-03985 https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_B2G-Run3Summer22EEwmLHEGS-03740
@hassan11196 Hi Ahmed, thank you for finding this one out and reviving it with fresh cases. Throughout the week, someone will go through the logs and try to collect more insight.
Hi @amaltaro, Here are a few more effected workflows, grouped by unified status.
already posted in the above comment.
Thanks
Impact of the bug Workflows (even fulfilling the number of events requested) aren't complete
Describe the bug task_HIG-RunIISummer20UL17wmLHEGEN-03631
task_HIG-RunIISummer20UL16wmLHEGENAPV-03387
task_HIG-RunIISummer20UL16wmLHEGENAPV-03384 When checking these 3 WFs, the number of events requested is each 8M, 5M and 270K but no TotalInputEvents in their reqMgr2 page
task_HIG-RunIISummer20UL16wmLHEGENAPV-05608 One more example in which the production is complete but not moving on to the next status due to the issue
How to reproduce it You can check the logs here and the JIRA tickets for WFs having this issue here
Expected behavior The parameter should be seen
@todor-ivanov Can you please have a look?