Successful (no job failure) workflows w/ missing output on DBS

haozturk commented 1 year ago

Impact of the bug Workflow announcement

Describe the bug We see lots of successful workflows w/ missing output on DBS. I'm not sure whether this is due to delay or failure in DBS injection or a problem in job failure accounting.

How to reproduce it Here are some affected workflows:

If you need a complete list, please let me know. I can provide it.

Expected behavior If there is no job failure, we expect to see 100% output on DBS.

Additional context and error message None

vkuznet commented 1 year ago

And, I adjusted my tool to report input/output dataset stats, so for this workflow pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352 I have now full report:

[
   {
      "Workflow": "pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352",
      "TotalInputLumis": 31372,
      "InputDataset": "/LowEGJet/Run2017G-v1/RAW",
      "OutputDataset": "/LowEGJet/Run2017G-09Aug2019_UL2017-v2/AOD",
      "InputStats": {
         "num_lumi": 31372,
         "num_file": 23666,
         "num_event": 967230225,
         "num_block": 52
      },
      "OutputStats": {
         "num_lumi": 31372,
         "num_file": 12327,
         "num_event": 967230225,
         "num_block": 36
      }
   },
   {
      "Workflow": "pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352",
      "TotalInputLumis": 31372,
      "InputDataset": "/LowEGJet/Run2017G-v1/RAW",
      "OutputDataset": "/LowEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD",
      "InputStats": {
         "num_lumi": 31372,
         "num_file": 23666,
         "num_event": 967230225,
         "num_block": 52
      },
      "OutputStats": {
         "num_lumi": 31372,
         "num_file": 1904,
         "num_event": 967230225,
         "num_block": 23
      }
   }
]

So, the difference is obvious in terms of files and blocks,

amaltaro commented 1 year ago

@vkuznet Valentin, the difference of number of files and blocks between input and output is totally expected. There is no way that we can guarantee that to be the same (work distribution, event size, etc). What really matters here is:

number of lumis in the input vs output
(not strictly necessary) number of events in the input vs output
(helpful for debugging) number of valid and invalid files (but this is not as cheap as a single "filesummaries" call)

vkuznet commented 1 year ago

ok, and I pointed out that in workflow which I checked certainly there are some number of invalid files which I checked via files DBS API.

vkuznet commented 1 year ago

and, I adjusted the tool I wrote to report number of invalid files, e.g.

   {
      "Workflow": "pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352",
      "TotalInputLumis": 31372,
      "InputDataset": "/LowEGJet/Run2017G-v1/RAW",
      "OutputDataset": "/LowEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD",
      "InputStats": {
         "num_lumi": 31372,
         "num_file": 23666,
         "num_event": 967230225,
         "num_block": 52,
         "num_invalid_files": 0
      },
      "OutputStats": {
         "num_lumi": 31372,
         "num_file": 1904,
         "num_event": 967230225,
         "num_block": 23,
         "num_invalid_files": 1
      },
      "Status": "OK"
   }

The report now compares stats based on lumis/events info, but also reports if given dataset has some invalid files which I obtained via DBS files API calls. So, the above workflow contains in output dataset some invalid files.

haozturk commented 1 year ago

I checked one affected workflow [1] with @vkuznet 's tool. I see that nLumis in the output and the input match (25332) and there is no invalid files in either datasets. The problem is that TotalInputLumis doesn't match: 25542. Do we understand why?

Another example [2]. nLumis in the output and the input is 100 whereas it's reported as 179

[1] pdmvserv_Run2022E_DisplacedJet_PromptNanoAODv10_v1_221017_083709_9356 [2] pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960

amaltaro commented 1 year ago

Reopening as there are still ongoing discussions.

In order to answer this question, the best we can do is to check the Global WorkQueue logs (reqmgrInteraction CP thread) and try to match all the cases where extra work has been added to the same workflow.

I don't think it's impossible that we count the same block stats multiple times (when iterating over a workflow that has been already splitted in GQ), of course it should not happen and it's just a wild guess of what could have gone wrong in these workflows.

@vkuznet can you please try to scan the global workqueue logs for one of these workflows?

vkuznet commented 1 year ago

@amaltaro , you put too much faith in my abilities. At least I need to know where GWQ log is, is it on production server (cmsweb.cern.ch)? Is it available on vocms0750? Does it called workqueu*.log? Then, it would be more useful to know which WMCore generates specific log entries, how nlumis are calculated which which log entries are generated.

I'm asking because I assume that GWQ is on cmsweb, and its log on vocms0750, and its log is called workqueue*.log, if so than tehre is nothing in there:

# from vocms0750
grep pdmvserv_Run2022E_DisplacedJet_PromptNanoAODv10_v1_221017_083709_9356 /cephfs/product/dmwm-logs/workqueue*.log

return nothing.

At least I need to know more information about GWQ and which patterns to look at and at which log(s).

vkuznet commented 1 year ago

I made an effort to look-up given workflow in workflow log. Here I made several assumptions which may or may not be true:

I assumed that global workqueue logs are called workflow*.log and they can be located on vocms0750
I look up pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 workflow in reqmgr2 and found that it was processed from 2022-10-03 till 2022-10-06, see here
I found workflow*.logs for these days in vocms070 in /cephfs/product/dmwm-logs/old-logs-20221001-0359.zip log
I unpacked the zip archive to extract workflow logs and scanned those dates

what I found is the following entries:

grep pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 wq/*.log
wq/workqueue-20221006-workqueue-545bc88678-6j7hh.log:INFO:cleanUpTask:Going to delete 1 documents in *workqueue* db for workflow: pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960. Doc IDs: ['a3e6145deeaaf42fb300af39ac515a3c']
wq/workqueue-20221006-workqueue-545bc88678-6j7hh.log:INFO:cleanUpTask:Going to delete 2 documents in *workqueue_inbox* db for workflow: pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960. Doc IDs: ['a3e6145deeaaf42fb300af39ac515a3c', 'pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960']

There is no any other information about given workflow and I have no other clue where to look it up.

@amaltaro , please provide further instructions how to deal with it. Better, as I requested, it would be nice to look at specific codebase which does the initial estimate of nlumis numbers and see the logic as well as if it logs anything into the log.

amaltaro commented 1 year ago

Valentin, see further comments and instructions below:

I assumed that global workqueue logs are called workflow*.log and they can be located on vocms0750

yes, logs are available in vocms0750:/ceph/production/dmwm-logs. Given that we only keep the last 4 or 5 days worth of logs, and older logs get zipped, you will actually find these logs in the non-deterministic zipped files, e.g.: old-logs-20221001-0359.zip --> reqmgrInteractionTask-workqueue-545bc88678-6j7hh-20221005.log

I look up pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666v1_T_221003_184201_4960 workflow in reqmgr2 and found that it was processed from 2022-10-03 till 2022-10-06, see [here](https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666v1_T_221003_184201_4960)

this is the way to move forward, checking when the statuses transition happened. Given that this one is about global workqueue, we don't really care about the whole lifetime of the workflow, but solely the transition from staging to staged, which is the moment that this workflow is executed in global workqueue (exception to growing workflows, that can keep acquiring data for longer period of time). In short, for this workflow we only care about date 20221005.

Expanding on the content of that reqmgrInteractionTask log file (mentioned above):

2022-10-05 21:13:43,230:INFO:WorkQueueReqMgrInterface:Processing request pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 at https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/spec
2022-10-05 21:13:43,230:INFO:WorkQueue:queueWork() begin queueing "https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache/pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/spec"
2022-10-05 21:13:43,450:INFO:WorkQueue:Executing processInboundWork with 1 inbound_work, throw: True and continuous: False
2022-10-05 21:13:43,542:INFO:WorkQueue:Splitting /pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/EXO-RunIISummer20UL17NanoAODv9-02666_0 with policy name Dataset and policy params {'name': 'Dataset', 'args': {}}
2022-10-05 21:13:43,962:INFO:WorkQueue:Work splitting completed with 1 units, 0 rejectedWork and 0 badWork
2022-10-05 21:13:43,962:INFO:WorkQueue:Queuing element a3e6145deeaaf42fb300af39ac515a3c for /pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960/EXO-RunIISummer20UL17NanoAODv9-02666_0 with policy Dataset, with 2 job(s) and 179 lumis on /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM
2022-10-05 21:13:49,576:INFO:WorkQueue:Split work for request(s): "pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960"
2022-10-05 21:13:49,598:INFO:WorkQueueReqMgrInterface:1 units(s) queued for "pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960"
...
2022-10-05 23:43:31,471:INFO:WorkQueue:Workflow pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 has no OpenRunningTimeout. Queuing to be closed.

Last line of this log represents the moment the workflow gets closed for further input data, so no more blocks/stats can be added to it.

In these logs, we can see that indeed 179 lumis was found/calculated. I had a quick look into the DBS entries, and my script reports 0 files marked as invalid (and filesummaries indeed says 100 lumis), so something is very wrong with this workflow.

In terms of source code, this code is quite complex, but it's performed by a Global Workqueue cherrypy app that starts up from this module: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueReqMgrInterface.py and after loading the workflow spec and figuring some parameters like the WorkQueue start policy, it will execute one of these modules (again, it depends on the workflow construction: no input, with input, with input MINIAOD, harvesting workflow, ACDC): https://github.com/dmwm/WMCore/tree/master/src/python/WMCore/WorkQueue/Policy/Start

Maybe the next step we can do is, to clone this workflow into one of our dev setup and see if global workqueue would again find the 179 lumis.

vkuznet commented 1 year ago

@amaltaro, I looked up the code and can reproduce 179 nlumis number. The code Start/Policy/Dataset.py calls validateBlocks function which by itself calls getDBSSummaryInfo function for provided block name. So, our dataset is

/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM

it has 4 blocks:

./dasgoclient -query="block dataset=$d"
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#471e5596-af04-4423-a850-5ef9091f154f
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#6eb03689-167a-472f-8b09-f4bfadad6a8a
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b8cdec8f-b664-49a6-ab2d-bb2a89893581
/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#ff78bb73-0e8c-41cb-9e51-381cfbdf15e2

and getDBSSummaryInfo calls filesummaries DBS API:

vk@vkair(10:30:56)$ blk1=/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM%23471e5596-af04-4423-a850-5ef9091f154f
[~/CMS/DMWM/GIT/wflow-dbs, main+1]
vk@vkair(10:30:59)$ blk3=/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM%23b8cdec8f-b664-49a6-ab2d-bb2a89893581
[~/CMS/DMWM/GIT/wflow-dbs, main+1]
vk@vkair(10:31:19)$ blk4=/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM%23ff78bb73-0e8c-41cb-9e51-381cfbdf15e2
[~/CMS/DMWM/GIT/wflow-dbs, main+1]
vk@vkair(10:31:37)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk1"
[
{"file_size":9407933969,"max_ldate":1664655121,"median_cdate":1664655121,"median_ldate":1664655121,"num_block":1,"num_event":111000,"num_file":4,"num_lumi":97}
]
[~/CMS/DMWM/GIT/wflow-dbs, main+1, 1s]
vk@vkair(10:31:42)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk2"
[
{"file_size":5638876233,"max_ldate":1662652610,"median_cdate":1662652610,"median_ldate":1662652610,"num_block":1,"num_event":72000,"num_file":1,"num_lumi":72}
]
[~/CMS/DMWM/GIT/wflow-dbs, main+1, 1s]
vk@vkair(10:31:44)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk3"
[
{"file_size":726640358,"max_ldate":1664151901,"median_cdate":1664151901,"median_ldate":1664151901,"num_block":1,"num_event":9000,"num_file":1,"num_lumi":9}
]
[~/CMS/DMWM/GIT/wflow-dbs, main+1, 1s]
vk@vkair(10:31:47)$ scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?block_name=$blk4"
[
{"file_size":105745614,"max_ldate":1664655121,"median_cdate":1664655121,"median_ldate":1664655121,"num_block":1,"num_event":1000,"num_file":1,"num_lumi":1}
]

If you sum-up num_limis across all blocks you'll get 179 :)

>>> 97+72+9+1
179

While, the filesummaries API for dataset returns 100:

scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/filesummaries?dataset=$d"
[
{"file_size":15879196174,"max_ldate":1664655121,"median_cdate":1664655121,"median_ldate":1664655121,"num_block":4,"num_event":193000,"num_file":7,"num_lumi":100}
]

So, the issue here is that filesummaries API for dataset and blocks provides different results. I looked up DBS queries and they differ as following:

dataset nlumis lookup

(select count(*) from (select distinct l.lumi_section_num, l.run_num from {{.Owner}}.files f
join {{.Owner}}.file_lumis l on l.file_id=f.file_id
join {{.Owner}}.datasets d on d.DATASET_ID = f.dataset_id
{{if .Valid}}
JOIN {{.Owner}}.DATASET_ACCESS_TYPES DT ON  DT.DATASET_ACCESS_TYPE_ID = D.DATASET_ACCESS_TYPE_ID
{{end}}
where d.dataset=:dataset wheresql_isFileValid)
) as num_lumi

block nlumis lookup

(select count(*) from (select distinct l.lumi_section_num, l.run_num from {{.Owner}}.files f
join {{.Owner}}.file_lumis l on l.file_id=f.file_id
join {{.Owner}}.blocks b on b.BLOCK_ID = f.block_id
{{if .Valid}}
JOIN {{.Owner}}.DATASETS D ON  D.DATASET_ID = F.DATASET_ID JOIN {{.Owner}}.DATASET_ACCESS_TYPES DT ON  DT.DATASET_ACCESS_TYPE_ID = D.DATASET_ACCESS_TYPE_ID
{{end}}
where b.BLOCK_NAME=:block_name wheresql_isFileValid)
) as num_lumi

The difference is additional join for DATASET_ACCESS_TYPES table in dataset query. I need time to investigate further DBS queries as I inherited them from Python based code. I'll report later with more results on how DBS queries differ for data and blocks, but obviously it is the root of the problem reported here.

amaltaro commented 1 year ago

Updating the comment above that tagged the "wrong" Alan (sorry about that!)

amaltaro commented 1 year ago

@vkuznet without looking carefully into this. My hypothesis is that filesummaries is actually returning UNIQUE tuples of run/lumi:

distinct l.lumi_section_num, l.run_num

which means that there are files in different blocks (and maybe even in the same block) that have exactly the same run/lumi tuple. Hence DBS returns a smaller amount of lumis when queried by dataset.

Given that this dataset is pretty small, I would suggest to retrieve all the run/lumis for all files in this dataset, order them and look how many duplicates we have (if any).

vkuznet commented 1 year ago

yes, this is the case different blocks have same run/lumi tuples. And, sorting them and taking unique set gives me 100 unit lumis. So the mystery is solved.

Said that, the remedy in WMCore should be the following:

fix validBlocks function call in Start/Policy/Dataset.py to call instead of filesummaries DBS API in getDBSSummaryInfo function to use filelumis DBS API which will provide list of files and run/lumis.
collect all results, and extract unit number of run/lumi pairs

For example:

# for block b1, get this output
scurl "https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader/filelumis?block_name=$b1"
# it will provide this JSON
[
{"event_count":1000,"logical_file_name":"/store/mc/RunIISummer20UL17MiniAODv2/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/60000/99CB45DF-2F92-4249-B57C-81E777C33EEB.root","lumi_section_num":34,"run_num":1}
,{"event_count":1000,"logical_file_name":"/store/mc/RunIISummer20UL17MiniAODv2/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/60000/F2C290E2-1D56-4F46-A590-2453E052FF52.root","lumi_section_num":1,"run_num":1}
...
]

Now, repeat this for all blocks, and extract run/lumis pairs. Then make a set of this list and take its size.

Here is a simple python code which does exactly that and it returns 100 as expected:

#!/usr/bin/env python3
import os
from WMCore.Services.pycurl_manager import RequestHandler

def blockLumis(blocks):
    mgr = RequestHandler()
    pairs = set()
    for blk in blocks:
        blk = blk.replace('#', '%23')
        url = 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader/filelumis?block_name={}'.format(blk)
        ckey = os.getenv('X509_USER_KEY')
        cert = os.getenv('X509_USER_CERT')
        data = mgr.getdata(url, params={}, headers={'Accept': 'application/json'}, ckey=ckey, cert=cert, decode=True)
        for row in data:
            pair = (row['lumi_section_num'], row['run_num'])
            pairs.add(pair)
    return len(pairs)

blocks = [
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#471e5596-af04-4423-a850-5ef9091f154f',
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#6eb03689-167a-472f-8b09-f4bfadad6a8a',
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b8cdec8f-b664-49a6-ab2d-bb2a89893581',
    '/LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#ff78bb73-0e8c-41cb-9e51-381cfbdf15e2'
]

res = blockLumis(blocks)
print(res)

amaltaro commented 1 year ago

Thank you for this investigation, Valentin.

Could you please check if the output dataset: /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM

also has those 179 lumis? If a merge job has multiple files with the same run/lumi, then the output would carry the unique information. But if the unmerged run/lumi is scattered in different merge jobs, then I think it's possible that the output dataset would have duplicate run/lumis. In that case, using the filesummaries for input discovery isn't wrong.

vkuznet commented 1 year ago

In this issue https://github.com/dmwm/WMCore/issues/11403#issue-1500301903 I provided two examples of python functions, blockLumis and concurrentBlockLumis, which can be used to check number of lumis and avoid duplicates. I tested both functions and concurrentBlockLumis only requires time for single block call and therefore will be much more efficient with large number of blocks in a dataset. If necessary both function can be added to Start/Policy/Dataset.py but it will require fix in pycurl_manager I provided over here https://github.com/dmwm/WMCore/pull/11404

vkuznet commented 1 year ago

@amaltaro , regarding NANOAODSIM we also have discrepancy. The nlumis for dataset is 100, the dataset has two blocks, and their sum of nlumis is 32+99=131. But using my code from https://github.com/dmwm/WMCore/issues/11403#issue-1500301903 both functions blockLumis and concurrentBlockLumis properly reports 100 lumis for provided blocks.

How would you like to move forward with this? My suggestion to add both blockLumis and concurrentBlockLumis functions to Start/Policy/Dataset.py and either switch to either of them or add UniqueNumLumis attribute to the outgoing JSON to avoid mess with counting unique lumis for list of blocks.

vkuznet commented 1 year ago

@amaltaro , is there anything left for this issue? My understanding that we fully debugged the issue and now understand its cause. We provided tools (either wmcore python script or wflow-dbs service) to data-ops, and I wonder if we need to keep open this issue. If so, it would be nice to list actions items required to move forward with this issue. Thanks.

amaltaro commented 1 year ago

Yes, I think we can declare this issue as resolved. In the future, we still have to think in a more sustainable way to find how many total and how many unique lumis were expected to be processed (which might be different between the beginning and end of the workflow lifetime as well).

@haozturk please reopen it in case there is anything else missing.

amaltaro commented 1 year ago

Qier was asking about the following workflow: pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871

which keeps coming back in operations as "noRecoveryDoc", thus without any ACDC documents to be recovered.

I decided to run it over the service that Valentin deployed in cmsweb-testbed and here is the output:

$ curl -k --cert $X509_USER_CERT --key $X509_USER_KEY --cacert $X509_USER_CERT "https://cmsweb-testbed.cern.ch/wflow-dbs/stats?workflow=pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871"
[
   {
      "Workflow": "pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871",
      "TotalInputLumis": 27653,
      "InputDataset": "/MET/Run2018C-15Feb2022_UL2018-v1/AOD",
      "OutputDataset": "/MET/Run2018C-UL2018_MiniAODv2_GT36-v1/MINIAOD",
      "InputStats": {
         "num_lumi": 27653,
         "num_file": 1188,
         "num_event": 31219922,
         "num_block": 29,
         "num_file_lumis": 20957,
         "unique_file_lumis": 20957,
         "filesummaries_lumis": 27653,
         "num_invalid_files": 1187
      },
      "OutputStats": {
         "num_lumi": 27605,
         "num_file": 432,
         "num_event": 31144738,
         "num_block": 14,
         "num_file_lumis": 27503,
         "unique_file_lumis": 27503,
         "filesummaries_lumis": 27605,
         "num_invalid_files": 431
      },
      "Status": "WARNING: number of lumis differ 27653 != 27605, number of events differ 31219922 != 31144738",
      "ElapsedTime": 6.685214328
   }

from the report above, it looks like the input data contains thousands of duplicate lumis (based on unique_file_lumis).

Actually, having a second look at these input metrics:

         "num_file": 1188,
         "num_invalid_files": 1187

it looks like the input dataset only has 1 valid file(!). I did a spot check and I think this is actually wrong, given that all 6 files in this block are actually valid: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/files?block_name=/MET/Run2018C-15Feb2022_UL2018-v1/AOD%23130e6704-0fa9-4675-848a-e80345d94640&detail=true

@vkuznet could you please review how you count those (output stats seem to be miscounted as well)?

vkuznet commented 1 year ago

Alan, yes there was an error (mis-match valid vs invalid in DBS API query). Now server is fixed and reports:

scurl "https://cmsweb-testbed.cern.ch/wflow-dbs/stats?workflow=pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871"
[
   {
      "Workflow": "pdmvserv_Run2018C_MET_UL2018_MiniAODv2_GT36_220415_083746_3871",
      "TotalInputLumis": 27653,
      "InputDataset": "/MET/Run2018C-15Feb2022_UL2018-v1/AOD",
      "OutputDataset": "/MET/Run2018C-UL2018_MiniAODv2_GT36-v1/MINIAOD",
      "InputStats": {
         "num_lumi": 27653,
         "num_file": 1188,
         "num_event": 31219922,
         "num_block": 29,
         "num_file_lumis": 27118,
         "unique_file_lumis": 27118,
         "filesummaries_lumis": 27653,
         "num_invalid_files": 0
      },
      "OutputStats": {
         "num_lumi": 27605,
         "num_file": 432,
         "num_event": 31144738,
         "num_block": 14,
         "num_file_lumis": 27541,
         "unique_file_lumis": 27541,
         "filesummaries_lumis": 27605,
         "num_invalid_files": 0
      },
      "Status": "WARNING: number of lumis differ 27653 != 27605, number of events differ 31219922 != 31144738",
      "ElapsedTime": 8.060713354
   }
]

z4027163 commented 1 year ago

Do you know what's the problem about this workflow? The num_invalid_files is 0, so it doesn't look like the invalidate issue.

amaltaro commented 1 year ago

@z4027163 issue with that workflow is not related to invalid files, but to amount of unique (or duplicate) run/lumis in the input, see:

         "num_lumi": 27653,
         "unique_file_lumis": 27118,

from the report above. Does it answer the remaining question that was reported at the CompOps meeting?

z4027163 commented 1 year ago

@vkuznet Can you give more details of what's the meaning of "unique_file_lumis"? I am a bit supprised that the output has a higher value than the input dataset.

vkuznet commented 1 year ago

@z4027163 , it sets over here: https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L37 and calculated in this function https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L194 But in plain English it is number of lumis returned by filelumis DBS API for a given block name (I resolved a dataset into block names, then query filelumis for every block and calculate unique number of run-lumi pairs).

amaltaro commented 1 year ago

@vkuznet according to the code, is it correct to say that unique_file_lumis and num_file_lumis contain the same information?

vkuznet commented 1 year ago

@amaltaro , the num_file_lumis represents total number of run-lumis pairs from filelumis API for all blocks in a dataset. While unique_file_lumis represents unique number of run-lumis from filelumis API for all blocks in a dataset. They may be the same or may differ. But, yes they contain similar information. Please see their assignments in a code:

https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L216
where uniqRunLumis function is defined here https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L222

dmwm / WMCore

Successful (no job failure) workflows w/ missing output on DBS #11358