dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Missing TotalInput[Events, Lumis, Files] params in production workflows #11183

Open jang00777 opened 2 years ago

jang00777 commented 2 years ago

Impact of the bug Workflows (even fulfilling the number of events requested) aren't complete

Describe the bug task_HIG-RunIISummer20UL17wmLHEGEN-03631

task_HIG-RunIISummer20UL16wmLHEGENAPV-03387

task_HIG-RunIISummer20UL16wmLHEGENAPV-03384 When checking these 3 WFs, the number of events requested is each 8M, 5M and 270K but no TotalInputEvents in their reqMgr2 page

task_HIG-RunIISummer20UL16wmLHEGENAPV-05608 One more example in which the production is complete but not moving on to the next status due to the issue

How to reproduce it You can check the logs here and the JIRA tickets for WFs having this issue here

Expected behavior The parameter should be seen

@todor-ivanov Can you please have a look?

haozturk commented 1 year ago

A new workflow affected by this issue: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer20UL16NanoAODAPVv9-06836

jenimal commented 1 year ago

It would be good to finally get these workflows out of the system. They haven't been touched since June

jang00777 commented 1 year ago

Workflows with this issue for now : https://its.cern.ch/jira/browse/CMSCOMPPR-31118

haozturk commented 1 year ago

Thanks @jang00777 , indeed the impact of the issue is ramping up. This link dynamically shows the list of affected workflows: https://its.cern.ch/jira/issues/?jql=labels%20%3D%20TotalInputEventsMissing

haozturk commented 1 year ago

As the issue is growing up, we need to figure out

  1. How to handle the affected workflows? We have no means to evaluate their output completion w/o these params. Any ideas?
  2. What's the root cause?
todor-ivanov commented 1 year ago

Just for logging purposes. After going back 3 months of GlobalWorkQueue logs (reqmanagerInteractionTask-wrokqueue-*) here is the only type of error I find regarding those workflows:

2022-10-12 12:49:44,477:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/spec"
2022-10-12 12:49:44,713:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
2022-10-12 12:49:44,791:INFO:WorkQueue:Splitting /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/HIG-RunIISummer20UL16wmLHEGENAPV-07316_0 with policy name MonteCarlo and policy params {'name': 'MonteCarlo', 'args': {}}
2022-10-12 12:49:44,841:INFO:Rucio:Container: /Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX with container-based location at: {'T2_CH_CERN', 'T1_US_FNAL_Disk'}
2022-10-12 12:49:44,843:INFO:WorkQueue:Work splitting completed with 1 units, 0 rejectedWork and 0 badWork
2022-10-12 12:49:44,843:INFO:WorkQueue:Queuing element 035a98b66e90dfdcf71826d001a12ad2 for /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/HIG-RunIISummer20UL16wmLHEGENAPV-07316_0 with policy MonteCarlo, with 217 job(s) and 217 lumis on events 1-216021
2022-10-12 12:49:45,182:ERROR:WorkQueue:Exception splitting wqe cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646 for cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646: url=https://cmsweb.cern.ch:8443/reqmgr2/data/request/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646, code=403, reason=Forbidden, headers={'Date': 'Wed, 12 Oct 2022 10:49:45 GMT', 'Server': 'Apache', 'Content-Type': 'text/html;charset=utf-8', 'Content-Length': '750', 'X-Rest-Status': '200', 'X-Error-Http': '403', 'X-Error-Id': 'dfc9450c0d1a0103d975e1c4382f469c', 'X-Error-Detail': 'You are not allowed to access this resource.', 'X-Rest-Time': '1700.640 us', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=9358 t=1665571785172282'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n    <title>403 Forbidden</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    </style>\n</head>\n    <body>\n        <h2>403 Forbidden</h2>\n        <p>You are not allowed to access this resource.</p>\n        <pre id="traceback"></pre>\n    <div id="powered_by">\n      <span>\n        Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n      </span>\n    </div>\n    </body>\n</html>\n'
Traceback (most recent call last):
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1150, in processInboundWork
    self.reqmgrSvc.updateRequestStats(inbound['WMSpec'].name(), totalStats)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 206, in updateRequestStats
    self.updateRequestProperty(request, stats)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 223, in updateRequestProperty
    return self["requests"].put('request/%s' % request, propDict)[0]['result']
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 150, in put
    return self.makeRequest(uri, data, 'PUT', incoming_headers,
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 173, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 190, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
    return callFunc(callObj, newUrl, *args, **kwargs)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 351, in request
    raise exc
http.client.HTTPException: url=https://cmsweb.cern.ch:8443/reqmgr2/data/request/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646, code=403, reason=Forbidden, headers={'Date': 'Wed, 12 Oct 2022 10:49:45 GMT', 'Server': 'Apache', 'Content-Type': 'text/html;charset=utf-8', 'Content-Length': '750', 'X-Rest-Status': '200', 'X-Error-Http': '403', 'X-Error-Id': 'dfc9450c0d1a0103d975e1c4382f469c', 'X-Error-Detail': 'You are not allowed to access this resource.', 'X-Rest-Time': '1700.640 us', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=9358 t=1665571785172282'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n    <title>403 Forbidden</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    </style>\n</head>\n    <body>\n        <h2>403 Forbidden</h2>\n        <p>You are not allowed to access this resource.</p>\n        <pre id="traceback"></pre>\n    <div id="powered_by">\n      <span>\n        Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n      </span>\n    </div>\n    </body>\n</html>\n'
2022-10-12 12:49:45,228:ERROR:WorkQueueReqMgrInterface:Unknown error processing cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646
Traceback (most recent call last):
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueueReqMgrInterface.py", line 108, in queueNewRequests
    units = queue.queueWork(workLoadUrl, request=reqName, team=team)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 641, in queueWork
    work = self.processInboundWork(inbound, throw=True)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1150, in processInboundWork
    self.reqmgrSvc.updateRequestStats(inbound['WMSpec'].name(), totalStats)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 206, in updateRequestStats
    self.updateRequestProperty(request, stats)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 223, in updateRequestProperty
    return self["requests"].put('request/%s' % request, propDict)[0]['result']
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 150, in put
    return self.makeRequest(uri, data, 'PUT', incoming_headers,
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 173, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 190, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
    return callFunc(callObj, newUrl, *args, **kwargs)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 351, in request
    raise exc
http.client.HTTPException: url=https://cmsweb.cern.ch:8443/reqmgr2/data/request/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646, code=403, reason=Forbidden, headers={'Date': 'Wed, 12 Oct 2022 10:49:45 GMT', 'Server': 'Apache', 'Content-Type': 'text/html;charset=utf-8', 'Content-Length': '750', 'X-Rest-Status': '200', 'X-Error-Http': '403', 'X-Error-Id': 'dfc9450c0d1a0103d975e1c4382f469c', 'X-Error-Detail': 'You are not allowed to access this resource.', 'X-Rest-Time': '1700.640 us', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=9358 t=1665571785172282'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n    <title>403 Forbidden</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    </style>\n</head>\n    <body>\n        <h2>403 Forbidden</h2>\n        <p>You are not allowed to access this resource.</p>\n        <pre id="traceback"></pre>\n    <div id="powered_by">\n      <span>\n        Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n      </span>\n    </div>\n    </body>\n</html>\n'

And later, the splitting has been resumed:

...
2022-10-12 12:55:06,814:INFO:WorkQueue:Workflow cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646 has no OpenRunningTimeout. Queuing to be closed.
...

2022-10-12 12:55:41,209:INFO:WorkQueueReqMgrInterface:Processing request cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646 at https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/spec
2022-10-12 12:55:41,209:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646/spec"
2022-10-12 12:55:41,288:INFO:WorkQueue:Resume splitting of "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646"
2022-10-12 12:55:41,288:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
2022-10-12 12:55:41,311:INFO:WorkQueue:Request "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646" already split - Resuming
2022-10-12 12:55:41,312:INFO:WorkQueue:Split work for request(s): "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646"
2022-10-12 12:55:41,333:INFO:WorkQueueReqMgrInterface:1 units(s) queued for "cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-07316__v1_T_220707_151735_7646"
todor-ivanov commented 1 year ago

And those calls that are failing are exactly related to updating the documents in couchDB through Reqmgr2 interface and adding the splitting information to it:

  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 150, in put
    return self.makeRequest(uri, data, 'PUT', incoming_headers,

And looking closer it proves to be exactly as @amaltaro mentioned could be the case. There are two separate steps that are happening (and in this case failing):

https://github.com/dmwm/WMCore/blob/0f73a08146ae4195f707e21d49d12f5670061f06/src/python/WMCore/WorkQueue/WorkQueue.py#L1150


  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1150, in processInboundWork
    self.reqmgrSvc.updateRequestStats(inbound['WMSpec'].name(), totalStats)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 206, in updateRequestStats
    self.updateRequestProperty(request, stats)
  File "/data/srv/HG2210b/sw/slc7_amd64_gcc630/cms/workqueue/2.1.3.pre1/lib/python3.8/site-packages/WMCore/Services/ReqMgr/ReqMgr.py", line 223, in updateRequestProperty
    return self["requests"].put('request/%s' % request, propDict)[0]['result']

https://github.com/dmwm/WMCore/blob/0f73a08146ae4195f707e21d49d12f5670061f06/src/python/WMCore/WorkQueue/WorkQueue.py#L1136-L1138

And this happens only for workflows(or elements - still not sure to which exactly this flag relates) with continuous flag set to False: if not continuous: ..., which is visible also from the logs:

2022-10-12 12:49:44,713:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False

This leads to transitioning the workflow without updating its statistics from the GlobalWorkGueue. And since GWQ does not reiterate through workflows in acquired state the end result is the missing workflow parameters.

Now we can try two approaches here:

Even though this is now better understood, we also need to find why did we have those code=403, reason=Forbidden, X-Error-Detail': 'You are not allowed to access this resource. errors at the first place. @haozturk do you remember noticing something in that regards in the past?

amaltaro commented 1 year ago

@todor-ivanov sorry for not leaving some ideas on how to tackle this. This is what I would do for the short-term solution:

  1. get a list of workflows provided by Hasan
  2. retrieve the workflow description from ReqMgr2
  3. if the workflow has InputDataset AND not (LumiList, block lists, run lists) then proceed with this logic, otherwise log the workflow name and skip it for a future fix
  4. get the input dataset name and query DBS filesummaries API (which will give you a total number of lumis)
  5. using an API like this one: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/Service/Request.py#L420, update the 4 Total* properties, with correct values for total lumis and total events, leaving 0 for jobs and files.

Once you cover these, we can touch base again on how to deal with the other workflows:

Please let me know if you have any question; and please let's review this script before running it for real.

todor-ivanov commented 1 year ago

Thanks for the hint @amaltaro. But those workflows that are suffering the issue seems to be missing any InputDataSet. Here I started working on a script to be used for the manual fix of the already broken workflows and here [1] is the full list of workflows found together with their type and input data. So we are not able to use any information from DBS and we should maybe think of the pieces from GW that need to be rerun in order to get the proper results for those missing parameter.

[1] badWfList.txt

amaltaro commented 1 year ago

Okay, then it will be even easier because we are now certain that these do not require any input sub-set (runs, blocks, lumis).

I had to check a couple of workflows to make sure that we perform the correct calculation, which has to use the following formula:

totalEvents = int(RequestNumEvents / FilterEfficiency)
totalLumis = math.ceil(totalEvents / EventsPerJob)  # we must round it up!

The final dictionary that we will have to post to ReqMgr2 will then be:

{"TotalInputEvents": totalEvents,  # must be integer
 "TotalInputLumis": totalLumis,  # must be integer
 "TotalEstimatedJobs": 1,  # let us just hard-code it to 1. Hasan doesn't care anyways
 "TotalInputFiles": 0}  # and this we keep hard-coded to 0, which is about right for MC anyways
todor-ivanov commented 1 year ago

Thanks for the hint @amaltaro : I found two more things while implementing your suggestion:


* There are also  two workflows that  are actually TaskChains  with `SplittingAlgo: EventAwareLumiBased`, which are having `InputDataset` . So it seems for those I'll have to implement and the queries to DBS after all.  Here they are:

2022-11-21 13:53:18,151:INFO:fetchFromReqmgr: Found a workflow with NonEventBased Splitting algorithm: 2022-11-21 13:53:18,151:INFO:fetchFromReqmgr: {'EventsPerJob': 66976, 'EventsPerLumi': 66976, 'FilterEfficiency': 1, 'InputDataset': '/TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM', 'RequestName': 'pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319', 'RequestNumEvents': None, 'RequestType': 'TaskChain', 'SplittingAlgo': 'EventAwareLumiBased', 'SubRequestType': 'ReDigi', 'TotalEstimatedJobs': None, 'TotalInputEvents': None, 'TotalInputFiles': None, 'TotalInputLumis': None} ... 2022-11-21 13:53:18,158:INFO:fetchFromReqmgr: Found a workflow with NonEventBased Splitting algorithm: 2022-11-21 13:53:18,158:INFO:fetchFromReqmgr: {'EventsPerJob': 68899, 'EventsPerLumi': 68899, 'FilterEfficiency': 1, 'InputDataset': '/NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-1000_MY-80_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM', 'RequestName': 'pdmvserv_task_HIG-RunIISummer20UL16NanoAODAPVv9-06906__v1_T_221008_072906_6717', 'RequestNumEvents': None, 'RequestType': 'TaskChain', 'SplittingAlgo': 'EventAwareLumiBased', 'SubRequestType': 'ReDigi', 'TotalEstimatedJobs': None, 'TotalInputEvents': None, 'TotalInputFiles': None, 'TotalInputLumis': None}

amaltaro commented 1 year ago

Thank you for this correction, Todor. This:

totalEvents = int(RequestNumEvents / FilterEfficiency)
totalLumis = math.ceil(totalEvents / EventsPerLumis)

is indeed the correct formula.

todor-ivanov commented 1 year ago

hi @amaltaro , Please take a look at the latest commit in https://github.com/dmwm/WMCore/pull/11366/ where I implemented the actual estimation of the missing statistics for workflows with EventAwareLumiBased splitting and InputDataset.

And here are the results for the two workflows I was mentioning before:

2022-11-22 14:46:28,454:INFO:fetchFromReqmgr: Found a workflow with EventAwareLumiBased Splitting algorithm:
2022-11-22 14:46:28,608:INFO:fetchFromReqmgr: 
{'BlockList': None,
 'EventsPerJob': 66976,
 'EventsPerLumi': 66976,
 'FilterEfficiency': 1,
 'InputDataset': '/TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM',
 'LumiList': {},
 'RequestName': 'pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319',
 'RequestNumEvents': None,
 'RequestType': 'TaskChain',
 'RunList': None,
 'SplittingAlgo': 'EventAwareLumiBased',
 'SubRequestType': 'ReDigi',
 'TotalEstimatedJobs': 992,
 'TotalInputEvents': 397395,
 'TotalInputFiles': 23,
 'TotalInputLumis': 401}
2022-11-22 14:46:28,609:INFO:fetchFromReqmgr: Found a workflow with EventAwareLumiBased Splitting algorithm:
2022-11-22 14:46:28,854:INFO:fetchFromReqmgr: 
{'BlockList': None,
 'EventsPerJob': 68899,
 'EventsPerLumi': 68899,
 'FilterEfficiency': 1,
 'InputDataset': '/NMSSM_XToYHTo2W2BTo2Q1L1Nu2B_MX-1000_MY-80_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODAPVv2-106X_mcRun2_asymptotic_preVFP_v11-v2/MINIAODSIM',
 'LumiList': {},
 'RequestName': 'pdmvserv_task_HIG-RunIISummer20UL16NanoAODAPVv9-06906__v1_T_221008_072906_6717',
 'RequestNumEvents': None,
 'RequestType': 'TaskChain',
 'RunList': None,
 'SplittingAlgo': 'EventAwareLumiBased',
 'SubRequestType': 'ReDigi',
 'TotalEstimatedJobs': 928,
 'TotalInputEvents': 215995,
 'TotalInputFiles': 31,
 'TotalInputLumis': 233}
amaltaro commented 1 year ago

@todor-ivanov the calculation is looking okay and I left some comments in your PR.

However, I would like to ask you to stick to the KISS principle. I see tons of over-complication and things being done in a likely unnecessary manner. For instance:

todor-ivanov commented 1 year ago

Thanks for the input @amaltaro

About:

what is the filter input argument? Do we actually need it to fix these workflows? what is the mask input argument? Do we actually need it to fix these workflows?

I just started the script with a different idea and approach on how to execute it, but then I switched to a more generic construction with having the WMcore services instanciated directly into the script and by this way giving the ability to use them through an interactive shell..... Anyway they just became obsolete, but I did not bother to remove them because this script so far is not supposed to be merged. If you want me I can remove those. The rest of your comments were addressed in my latest commit.

todor-ivanov commented 1 year ago

Hi @amaltaro Finally I did the last bit of it, and I am capable of running the script in dryRun mode to go through the system and find all workflows which need to be fixed. Here fixMissingStats.log I upload a log from such a run.

@haozturk if you could take a quick look and see if this is all that Unified things needs to be fixed I am going to upload the workflows' statistics in a single push. But that'll be a one time action. So once we upload the numbers we will be having bigger troubles to find and fix those in case we have done something wrong. That said, I'd suggest to check a workflow or two just to see if things look reasonable. Thanks in advance!

amaltaro commented 1 year ago

@todor-ivanov thanks Todor. I looked into a couple of stats and it looks good to me. Your update call seem to have collected the correct properties as well. From my side, you can go ahead and fix those stats.

todor-ivanov commented 1 year ago

Thanks @amaltaro, I just did so!

haozturk commented 1 year ago

I confirm that we don't see workflows w/ missing TotalInput[Events, Lumis, Files] anymore. Thanks a lot @todor-ivanov @amaltaro !

hassan11196 commented 3 months ago

Hi @amaltaro,

I found this old issue for the missing Params problems, I am attaching the recently affected worklfows here.

https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_B2G-Run3Summer22EEwmLHEGS-03985 https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_B2G-Run3Summer22EEwmLHEGS-03740

amaltaro commented 3 months ago

@hassan11196 Hi Ahmed, thank you for finding this one out and reviving it with fresh cases. Throughout the week, someone will go through the logs and try to collect more insight.

hassan11196 commented 2 months ago

Hi @amaltaro, Here are a few more effected workflows, grouped by unified status.

assistance-manual-missingParam

assistance-missingParam-noRecoveryDoc

already posted in the above comment.

Thanks