dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Single workflow with duplicated lumi sections #9432

Closed amaltaro closed 4 years ago

amaltaro commented 5 years ago

Impact of the bug WMAGent

Describe the bug @vlimant reported via slack that this workflow pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301 has had duplicate lumi sections. Even though there are no other workflows (ACDC and alike) writing to the same output datasets. We need to investigate it and find out whether: a) the agent really assigned the same lumi section to multiple jobs b) or was it another problem (maybe something wrong with a merge) c) find the jobs that produced those lumis, check the logs and the FJR

How to reproduce it no clue!

Expected behavior No duplicate lumi sections should ever be created!

Additional context and error message Report from Unified: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-RunIISummer19UL17wmLHEGEN-00493&module=checkor&limit=50&size=1000

Dimas page: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer19UL17wmLHEGEN-00493

amaltaro commented 5 years ago

Adding some extra information...

According to unified logs, these were the 2 files invalidated (and their respective lumi sections): https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root

wmstats claims that that workflow was processed by the following 3 agents (again, no ACDCs or other workflows writing to the same GEN dataset): agent url: vocms0252.cern.ch,vocms0254.cern.ch,vocms0253.cern.ch

Checking the dbsbuffer tables on those 3 agents, querying for any files matching that LFN and that produced a/some of those 7 lumi sections:

select df.lfn, dfrm.* from dbsbuffer_file df
  INNER JOIN dbsbuffer_file_runlumi_map dfrm ON df.id = dfrm.filename
  WHERE df.lfn like '/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/%'
  AND dfrm.lumi in (3226, 3231, 351, 249, 1381, 3199, 917) ORDER BY df.lfn, dfrm.lumi;

the result is:

vocms0252:
LFN                                                     |  FILEID  |  RUN  |  LUMI  | NUM_EVENTS
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root  500282  1   249 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root  500282  1   351 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root  500282  1   1381    300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root  500282  1   3199    300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root  500282  1   3226    300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root  500282  1   3231    300

vocms0253:
LFN                                                     |  FILEID  |  RUN  |  LUMI  | NUM_EVENTS
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/239E7C08-053C-3544-AD9E-DF30AB97E5C4.root  514778  1   917 96
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/567BF783-3D2D-CD46-A2B5-83821C2C5B89.root  517550  1   249 96
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/5DC32B3F-F244-D540-B640-0508B20908FB.root  514183  1   1381    97
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/97972246-0E36-5E43-9F52-4368FFD47DE5.root  513983  1   351 109
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/AAB0B18B-BAE0-4243-8B2F-1834280C925D.root  514764  1   3226    97
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/AAB0B18B-BAE0-4243-8B2F-1834280C925D.root  514764  1   3231    98
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/EA04CEEB-C101-3E41-967D-2A35A81E6D68.root  514262  1   3199    101

vocms0254:
LFN                                                     |  FILEID  |  RUN  |  LUMI  | NUM_EVENTS
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root  428389  1   917 118

in short, vocms0253 produced all 7 lumis, vocms0254 produced 1 lumi, and vocms0252 produced 6 lumis.

Next step is to investigate what were the input WQE lumi mask and which jobs produced those.

amaltaro commented 5 years ago

Ok, now checking what the input file (MCFakeFile) was and their lumi mask, so this query:

select wfd.lfn, wfrm.* from wmbs_file_details wfd
  INNER JOIN wmbs_file_runlumi_map wfrm ON wfd.id = wfrm.fileid
  where wfd.lfn like 'MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301%'
  AND wfrm.lumi IN (3226, 3231, 351, 249, 1381, 3199, 917) ORDER by wfrm.lumi;

and surprise! There are no results in vocms0252 and vocms0254. Only vocms0253 was supposed to have processed those lumis (all from the same MCFakeFile, thus the same WQE):

LFN  |  FILEID  |  RUN  |  LUMI  |  NUM_EVENTS
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   249 
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   351 
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   917 
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   1381    
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   3199    
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   3226    
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1   3231    
amaltaro commented 5 years ago

Okay! Now that I managed to resume investigating it... If I hand pick one of those duplicate lumi sections, e.g. either 249 or 917, and check for jobs that got those lumi sections assigned to, I cannot find any results but in vocms0253, which is the node that really acquired WQEs including those lumi sections.

So running a query like

select wjm.job, wjm.firstrun, wjm.lastrun, wjm.firstlumi, wjm.lastlumi from wmbs_job_mask wjm
  INNER JOIN wmbs_job wj ON wjm.job = wj.id
  WHERE wjm.firstlumi>900 and wjm.lastlumi<1000 
  and wj.cache_dir like '%/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/HIG-RunIISummer19UL17wmLHEGEN-00493_0/JobCollection_%';

yield these results in vocms0253 (nothing on the other two agents):

JOB  |  FIRSTRUN  | LASTRUN  | FIRSTLUMI  | LASTLUMI
1980252 1   1   913 928
1980253 1   1   929 944
1980254 1   1   945 960
1980255 1   1   961 976
1980256 1   1   977 992

which sort of makes me believe we've got a lumi section (from the FJR) that we didn't expect too.

Checking the least painful way to find logs for those files now...

amaltaro commented 5 years ago

Okay, looking at one of the files that got invalidated (with lumis (3226, 3231, 351, 249, 1381, 3199)):

/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root

it was part of this block:

/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/RunIISummer19UL17wmLHEGEN-106X_mc2017_realistic_v6-v1/GEN#2b1c32bd-87cb-46d6-ba3e-66b8aa90a465

which had a single file, the one invalidated above, and was produced at T2_PK_NCP.

With this information, and checking when the file was injected into PhEDEx (against PhEDExInjector logs), I could reduce the time window to around Nov 3rd to 5th. Now using the following query in es-cms.cern.ch

ScheddName:vocms0252.cern.ch AND CMS_JobType:Merge AND Site:T2_PK_NCP AND WMAgent_RequestName:pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301

I found a few jobs, and finally the one that created the file with 6 duplicate lumis!!!!

This was the Merge job that produced the duplicate lumis in vocms0252:

/data/srv/wmagent/current/install/wmagent/JobArchiver/logDir/p/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/JobCluster_2124/Job_2124060.tar.bz2

and I checked EVERY single input unmerged file against the wmbs_file_runlumi_map table and none of the lumi sections correspond to the list of lumi sections in the output file (from the WMAgent Report file).

The logArchive for this merge job was put at:

/store/unmerged/logs/prod/2019/11/4/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/HIG-RunIISummer19UL17wmLHEGEN-00493_0/HIG-RunIISummer19UL17wmLHEGEN-00493_0MergeRAWSIMoutput/0000/4/2807d08a-612d-4fb0-952f-f642fba819f0-0-4-logArchive.tar.gz

so we should look into that and see what the CMSSW framework job report says. Another option would be to actually open this merge file and look at its content (but it got invalidated already, so it's not an option anymore).

A dump of the job report can be found here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/reportDump_2165285.txt

amaltaro commented 5 years ago

Trying put my hands on that logArchive, to see what the FJR says. If it reports the same lumis that got reported in the Report.pkl file, then our last option would be to open the root file (invalidated) itself. Which AFAIK is no longer available and nothing else can be done (neither the root cause understood).

FYI @vlimant

vlimant commented 5 years ago

the file is still out there, xrdcp got it /afs/cern.ch/user/v/vlimant/public/ops/50C82023-B12A-794A-9638-116C5FD179C6.root

vlimant commented 5 years ago

the interesting part is with

/store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root

that contains non consecutive lumisections indeed, with atypical number of events (300), for a configuration that has a filter efficiency

What you need is the log of the job that produced that file

vlimant commented 5 years ago

there are no such logs of job that produced that file under /eos/cms/store/unmerged/logs/prod/2019/11/4/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/HIG-RunIISummer19UL17wmLHEGEN-00493_0/00//*

amaltaro commented 5 years ago

And here is the WMCore log: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/Job_2046469.tar.bz2

for the job that produced the unmerged file /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root

Here is a dump of the Report.pkl parsed by the agent: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/reportDump_2046469.txt

and from the wmagentJob.log, this was the job mask:

'mask': {'LastRun': 1, 'LastLumi': 747584, 'FirstRun': 1, 'inclusivemask': True, 'runAndLumis': {}, 'LastEvent': 458268992, 'FirstEvent': 458259185, 'jobID': 2046469, 'FirstLumi': 747569}

which matches with what gets reported to both the LHE and the GEN files.

@vlimant how did you check it has non-consecutive lumi sections?

At this point, I think something funky happened in the framework... I'm still trying to put my hands on that log.

amaltaro commented 5 years ago

Apparently the CMSSW log tarball that created the unmerged file will never be found. The LogCollect job failed to access logArchive files, but the Cleanup didn't fail to delete them!

I'm done looking for logs. My previous comment makes it clear that the agent passed the right lumi mask to the PSet. IF it hadn't, then we should have seen duplicates in the LHE dataset as well (both LHE and GEN were produced by the same job).

amaltaro commented 5 years ago

Ok, I managed to fetch all the logs we might need and they are all placed here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/vocms0254/

In short, the production job was setup to produce 16 lumis from 538273 to 538288, and that's exactly what I see in the output WMCore report, in the CMSSW FJR and in the database (for the unmerged file).

Then this file went to a merge job (with a few other files that got skipped), the merge job has no lumi mask, just a list of unmerged files as input. The output WMCore report reports lumi 917, the CMSSW FJR also reports 917 and so is the database. Question is, where does that lumi come from?

I also managed to download that unmerged file and here are its lumi sections

amaltaro@lxplus731:~/CMSSW_10_6_4/src $ edmLumisInFiles.py 81CB9803-B71C-464B-AD3F-5B53605EF41E.root
{"1": [[538273, 538288]]}

I guess we need to hand it over to the Framework team.

vlimant commented 5 years ago

"non consecutive lumisections" from http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/reportDump_2165285.txt


                       'input': {'source': [{'catalog': '',
                                             'events': 1800,
                                             'guid': '1C1EB1B2-DE76-E711-8E08-0025905A60B6',
                                             'input_source_class': 'PoolSource',
                                             'input_type': 'primaryFiles',
                                             'lfn': '/store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root',
                                             'module_label': 'source',
                                             'pfn': 'root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root',
                                             'runs': {'1': {249: None,
                                                            351: None,
                                                            1381: None,
                                                            3199: None,
                                                            3226: None,
                                                            3231: None}}}]},
amaltaro commented 5 years ago

I think it's time to label you here Chris @Dr15Jones The short story was posted here: https://github.com/dmwm/WMCore/issues/9432#issuecomment-557911245

could you please have a look at those logs (or tag someone else that can) to see whether it's some nasty bug in CMSSW; or whether I'm missing some detail in WMCore.

Honestly, I don't see that lumi section coming from anything managed by WMCore. Please let me know if you want an specific issue created in the cmssw repo too.

Dr15Jones commented 5 years ago

FYI @bbockelm

So looking at the log from 6420a2d1-6f71-4aea-bbb7-955e67a1db05-0-2-logArchive.tar.gz we see that the first 6 files passed to the job all fail to open and are skipped, e.g.

06-Nov-2019 17:55:12 UTC  Initiating request to open file root://se01.grid.nchc.org.tw//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root
[2019-11-06 17:55:12.281465 +0000][Error  ][XRootD            ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_query (code: kXR_Qconfig, arg length: 9): [FATAL] Connection error
[2019-11-06 17:55:12.281561 +0000][Error  ][XRootD            ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Connection error
06-Nov-2019 17:55:12 UTC  Fallback request to file root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root
[2019-11-06 17:56:27.285088 +0000][Warning][XRootD            ] [xrootd-cms.infn.it:1094] Redirect limit has been reached for message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=+1413xrootd.ba.infn.it,se01.grid.nchc.org.tw,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it&triedrc=enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ), the last known error is: [ERROR] Error response: No such file or directory
%MSG-w XrdAdaptorInternal:  file_open 06-Nov-2019 17:56:27 UTC pre-events
Failed to open file at URL root://xrootd-cms.infn.it:1094//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=+1413xrootd.ba.infn.it,se01.grid.nchc.org.tw.
%MSG
[2019-11-06 17:57:40.294654 +0000][Warning][XRootD            ] [xrootd-cms.infn.it:1094] Redirect limit has been reached for message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=+1413xrootd.ba.infn.it,se01.grid.nchc.org.tw,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it&triedrc=enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ), the last known error is: [ERROR] Error response: No such file or directory
%MSG-w XrdAdaptorInternal:  file_open 06-Nov-2019 17:57:40 UTC pre-events
Failed to open file at URL root://xrootd-cms.infn.it:1094//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=.
%MSG
%MSG-w :  AfterFile 06-Nov-2019 17:57:40 UTC pre-events
Input file: root://se01.grid.nchc.org.tw//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root was not found or could not be opened, and will be skipped.

Then the 7th one fails to open at se01.grid.nchc.org.tw, gets redirected to root://xrootd-cms.infn.it but then reports

%MSG-w XrdAdaptor:  file_open 06-Nov-2019 17:57:52 UTC pre-events
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
%MSG

and then starts processing Run 1, Event 1605760, LumiSection 917

The full part of the log related to that is

%MSG
06-Nov-2019 17:57:40 UTC  Initiating request to open file root://se01.grid.nchc.org.tw//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
[2019-11-06 17:57:40.301526 +0000][Error  ][XRootD            ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_query (code: kXR_Qconfig, arg length: 9): [FATAL] Connection error
[2019-11-06 17:57:40.301604 +0000][Error  ][XRootD            ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Connection error
06-Nov-2019 17:57:40 UTC  Fallback request to file root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
%MSG-w XrdAdaptor:  file_open 06-Nov-2019 17:57:52 UTC pre-events
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
%MSG
06-Nov-2019 17:57:55 UTC  Successfully opened file root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
Begin processing the 1st record. Run 1, Event 1605760, LumiSection 917 on stream 0 at 06-Nov-2019 18:04:18.938 UTC
Begin processing the 2nd record. Run 1, Event 1605768, LumiSection 917 on stream 0 at 06-Nov-2019 18:04:19.542 UTC

If I now do

 edmFileUtil -e root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root

I find that the file exists, but does not contain either LuminosityBlock 917 nor event number 1605760.

The only conclusion I can draw, is when the redirection happened, the system sent data from the wrong file. Is it possible that around this time, a file with a similar name was sitting at brunel.ac.uk and the redirection picked up that file?

Dr15Jones commented 5 years ago

@amaltaro Is it possible to get access to the merged file from the job? That job report says the GUID if the file is 3B506A4B-B36A-374A-B87A-CB2FA045AAE4 with a LFN of

<LFN>/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/Merged.root</LFN>

Looking at the internal provenance of the file might give us a clue as to the actual origin of the file which was read during the merge job.

amaltaro commented 5 years ago

Thanks for looking into this, Chris. Yes, I managed to retrieve that file and it's available here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/vocms0254/

or if you want to go straight to the source, you should also be able to retrieve it via

xrdcp root://cmsxrootd.fnal.gov//store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root .
Dr15Jones commented 5 years ago

Using edmProvDump I found the file which was read during the merge was generated using CMSSW_10_2_16_patch2. It used the externalLHEProducer configured as

Module: externalLHEProducer SIM
 PSet id:2496cdda1332a2dab7b18d789f779ff2
 products: {
  LHEEventProduct_externalLHEProducer__SIM.
  LHERunInfoProduct_externalLHEProducer__SIM.
 }
 parameters: {
  @module_edm_type: string tracked  = 'EDProducer'
  @module_label: string tracked  = 'externalLHEProducer'
  @module_type: string tracked  = 'ExternalLHEProducer'
  args: vstring tracked  = {'/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc630/13TeV/madgraph/V5_2.6.0/GluGluToHToTauTau/v1/GluGluToHToTauTauPlusTwoJets_M125_13TeV_amcatnloFXFX_pythia8_slc6_amd64_gcc630_CMSSW_9_3_8_tarball.tar.xz'}
  numberOfParameters: uint32 tracked  = 1
  outputFile: string tracked  = 'cmsgrid_final.lhe'
  scriptName: FileInPath tracked  = V001 GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh 2 /src/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh
 }

It also has generator set as Pythia8HanronizerFilter configured as

Module: generator SIM
 PSet id:4aeb52791412d998a0497f75dd7dd209
 products: {
  GenEventInfoProduct_generator__SIM.
  GenFilterInfo_generator__SIM.
  GenLumiInfoHeader_generator__SIM.
  GenLumiInfoProduct_generator__SIM.
  GenRunInfoProduct_generator__SIM.
  edmHepMCProduct_generator_unsmeared_SIM.
 }
 parameters: {
  @module_edm_type: string tracked  = 'EDFilter'
  @module_label: string tracked  = 'generator'
  @module_type: string tracked  = 'Pythia8HadronizerFilter'
  comEnergy: double tracked  = 13000
  HepMCFilter: PSet tracked = ({
   filterName: string tracked  = 'EmbeddingHepMCFilter'
   filterParameters: PSet tracked = ({
    BosonPDGID: int32 tracked  = 25
    ElElCut: string tracked  = 'El1.Pt > 22 && El2.Pt > 10 && El1.Eta < 2.6 && El2.Eta < 2.6'
    ElHadCut: string tracked  = 'El.Pt > 22 && Had.Pt > 16 && El.Eta < 2.6 && Had.Eta < 2.7'
    ElMuCut: string tracked  = 'Mu.Pt > 9 && El.Pt > 11 && El.Eta < 2.6 && Mu.Eta < 2.5'
    Final_States: vstring tracked  = {'ElHad','ElMu','HadHad','MuHad'}
    HadHadCut: string tracked  = 'Had1.Pt > 33 && Had2.Pt > 33 && Had1.Eta < 2.5 && Had2.Eta < 2.5'
    MuHadCut: string tracked  = 'Mu.Pt > 19 && Had.Pt > 16 && Mu.Eta < 2.5 && Had.Eta < 2.7'
    MuMuCut: string tracked  = 'Mu1.Pt > 17 && Mu2.Pt > 8 && Mu1.Eta < 2.5 && Mu2.Eta < 2.5'
   })
  })
  PythiaParameters: PSet tracked = ({
   parameterSets: vstring tracked  = {'pythia8CommonSettings','pythia8CP5Settings','pythia8aMCatNLOSettings','pythia8PSweightsSettings','processParameters'}
   processParameters: vstring tracked  = {'JetMatching:setMad = off','JetMatching:scheme = 1','JetMatching:merge = on','JetMatching:jetAlgorithm = 2','JetMatching:etaJetMax = 999.','JetMatching:coneRadius = 1.','JetMatching:slowJetPower = 1','JetMatching:qCut = 30.','JetMatching:doFxFx = on','JetMatching:qCutME = 10.','JetMatching:nQmatch = 5','JetMatching:nJetMax = 2','SLHA:useDecayTable = off','25:m0 = 125.0','25:onMode = off','25:onIfMatch = 15 -15','Higgs:useBSM = on','HiggsH1:parity = 1'}
   pythia8CP5Settings: vstring tracked  = {'Tune:pp 14','Tune:ee 7','MultipartonInteractions:ecmPow=0.03344','PDF:pSet=20','MultipartonInteractions:bProfile=2','MultipartonInteractions:pT0Ref=1.41','MultipartonInteractions:coreRadius=0.7634','MultipartonInteractions:coreFraction=0.63','ColourReconnection:range=5.176','SigmaTotal:zeroAXB=off','SpaceShower:alphaSorder=2','SpaceShower:alphaSvalue=0.118','SigmaProcess:alphaSvalue=0.118','SigmaProcess:alphaSorder=2','MultipartonInteractions:alphaSvalue=0.118','MultipartonInteractions:alphaSorder=2','TimeShower:alphaSorder=2','TimeShower:alphaSvalue=0.118'}
   pythia8CommonSettings: vstring tracked  = {'Tune:preferLHAPDF = 2','Main:timesAllowErrors = 10000','Check:epTolErr = 0.01','Beams:setProductionScalesFromLHEF = off','SLHA:keepSM = on','SLHA:minMassSM = 1000.','ParticleDecays:limitTau0 = on','ParticleDecays:tau0Max = 10','ParticleDecays:allowPhotonRadiation = on'}
   pythia8PSweightsSettings: vstring tracked  = {'UncertaintyBands:doVariations = on','UncertaintyBands:List = {isrRedHi isr:muRfac=0.707,fsrRedHi fsr:muRfac=0.707,isrRedLo isr:muRfac=1.414,fsrRedLo fsr:muRfac=1.414,isrDefHi isr:muRfac=0.5,fsrDefHi fsr:muRfac=0.5,isrDefLo isr:muRfac=2.0,fsrDefLo fsr:muRfac=2.0,isrConHi isr:muRfac=0.25,fsrConHi fsr:muRfac=0.25,isrConLo isr:muRfac=4.0,fsrConLo fsr:muRfac=4.0,fsr_G2GG_muR_dn fsr:G2GG:muRfac=0.5,fsr_G2GG_muR_up fsr:G2GG:muRfac=2.0,fsr_G2QQ_muR_dn fsr:G2QQ:muRfac=0.5,fsr_G2QQ_muR_up fsr:G2QQ:muRfac=2.0,fsr_Q2QG_muR_dn fsr:Q2QG:muRfac=0.5,fsr_Q2QG_muR_up fsr:Q2QG:muRfac=2.0,fsr_X2XG_muR_dn fsr:X2XG:muRfac=0.5,fsr_X2XG_muR_up fsr:X2XG:muRfac=2.0,fsr_G2GG_cNS_dn fsr:G2GG:cNS=-2.0,fsr_G2GG_cNS_up fsr:G2GG:cNS=2.0,fsr_G2QQ_cNS_dn fsr:G2QQ:cNS=-2.0,fsr_G2QQ_cNS_up fsr:G2QQ:cNS=2.0,fsr_Q2QG_cNS_dn fsr:Q2QG:cNS=-2.0,fsr_Q2QG_cNS_up fsr:Q2QG:cNS=2.0,fsr_X2XG_cNS_dn fsr:X2XG:cNS=-2.0,fsr_X2XG_cNS_up fsr:X2XG:cNS=2.0,isr_G2GG_muR_dn isr:G2GG:muRfac=0.5,isr_G2GG_muR_up isr:G2GG:muRfac=2.0,isr_G2QQ_muR_dn isr:G2QQ:muRfac=0.5,isr_G2QQ_muR_up isr:G2QQ:muRfac=2.0,isr_Q2QG_muR_dn isr:Q2QG:muRfac=0.5,isr_Q2QG_muR_up isr:Q2QG:muRfac=2.0,isr_X2XG_muR_dn isr:X2XG:muRfac=0.5,isr_X2XG_muR_up isr:X2XG:muRfac=2.0,isr_G2GG_cNS_dn isr:G2GG:cNS=-2.0,isr_G2GG_cNS_up isr:G2GG:cNS=2.0,isr_G2QQ_cNS_dn isr:G2QQ:cNS=-2.0,isr_G2QQ_cNS_up isr:G2QQ:cNS=2.0,isr_Q2QG_cNS_dn isr:Q2QG:cNS=-2.0,isr_Q2QG_cNS_up isr:Q2QG:cNS=2.0,isr_X2XG_cNS_dn isr:X2XG:cNS=-2.0,isr_X2XG_cNS_up isr:X2XG:cNS=2.0}','UncertaintyBands:nFlavQ = 4','UncertaintyBands:MPIshowers = on','UncertaintyBands:overSampleFSR = 10.0','UncertaintyBands:overSampleISR = 10.0','UncertaintyBands:FSRpTmin2Fac = 20','UncertaintyBands:ISRpTmin2Fac = 1'}
   pythia8aMCatNLOSettings: vstring tracked  = {'SpaceShower:pTmaxMatch = 1','SpaceShower:pTmaxFudge = 1','SpaceShower:MEcorrections = off','TimeShower:pTmaxMatch = 1','TimeShower:pTmaxFudge = 1','TimeShower:MEcorrections = off','TimeShower:globalRecoil = on','TimeShower:limitPTmaxGlobal = on','TimeShower:nMaxGlobalRecoil = 1','TimeShower:globalRecoilMode = 2','TimeShower:nMaxGlobalBranch = 1','TimeShower:weightGluonToQuark = 1'}
  })
 }

The file the was supposed to be read was created in CMSSW_10_6_4, has a externalLHEProducer configured as

Module: externalLHEProducer GEN
 PSet id:3de64faceacf866730598c905d987c98
 products: {
  LHEEventProduct_externalLHEProducer__GEN.
  LHERunInfoProduct_externalLHEProducer__GEN.
 }
 parameters: {
  @module_edm_type: string tracked  = 'EDProducer'
  @module_label: string tracked  = 'externalLHEProducer'
  @module_type: string tracked  = 'ExternalLHEProducer'
  args: vstring tracked  = {'/cvmfs/cms.cern.ch/phys_generator/gridpacks/UL/13TeV/madgraph/V5_2.6.1/QCD_HT_Binned/QCD_HT100to200/v1/QCD_HT100to200_MG261_NNPDF31_slc7_amd64_gcc700_CMSSW_10_6_0_tarball.tar.xz'}
  numberOfParameters: uint32 tracked  = 1
  outputFile: string tracked  = 'cmsgrid_final.lhe'
  scriptName: FileInPath tracked  = V001 GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh 2 /src/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh
 }
 dependencies: {
 }

and a generator defined as

Module: generator GEN
 PSet id:d1f5dc9fcf2ebcb5c9017bef5836a70a
 products: {
  GenEventInfoProduct_generator__GEN.
  GenLumiInfoHeader_generator__GEN.
  GenLumiInfoProduct_generator__GEN.
  GenRunInfoProduct_generator__GEN.
  edmHepMCProduct_generator_unsmeared_GEN.
 }
 parameters: {
  @module_edm_type: string tracked  = 'EDFilter'
  @module_label: string tracked  = 'generator'
  @module_type: string tracked  = 'Pythia8HadronizerFilter'
  comEnergy: double tracked  = 13000
  PythiaParameters: PSet tracked = ({
   parameterSets: vstring tracked  = {'pythia8CommonSettings','pythia8CP5Settings','pythia8PSweightsSettings','processParameters'}
   processParameters: vstring tracked  = {'JetMatching:setMad = off','JetMatching:scheme = 1','JetMatching:merge = on','JetMatching:jetAlgorithm = 2','JetMatching:etaJetMax = 5.','JetMatching:coneRadius = 1.','JetMatching:slowJetPower = 1','JetMatching:qCut = 14.','JetMatching:nQmatch = 5','JetMatching:nJetMax = 4','JetMatching:doShowerKt = off'}
   pythia8CP5Settings: vstring tracked  = {'Tune:pp 14','Tune:ee 7','MultipartonInteractions:ecmPow=0.03344','MultipartonInteractions:bProfile=2','MultipartonInteractions:pT0Ref=1.41','MultipartonInteractions:coreRadius=0.7634','MultipartonInteractions:coreFraction=0.63','ColourReconnection:range=5.176','SigmaTotal:zeroAXB=off','SpaceShower:alphaSorder=2','SpaceShower:alphaSvalue=0.118','SigmaProcess:alphaSvalue=0.118','SigmaProcess:alphaSorder=2','MultipartonInteractions:alphaSvalue=0.118','MultipartonInteractions:alphaSorder=2','TimeShower:alphaSorder=2','TimeShower:alphaSvalue=0.118','SigmaTotal:mode = 0','SigmaTotal:sigmaEl = 21.89','SigmaTotal:sigmaTot = 100.309','PDF:pSet=LHAPDF6:NNPDF31_nnlo_as_0118'}
   pythia8CommonSettings: vstring tracked  = {'Tune:preferLHAPDF = 2','Main:timesAllowErrors = 10000','Check:epTolErr = 0.01','Beams:setProductionScalesFromLHEF = off','SLHA:keepSM = on','SLHA:minMassSM = 1000.','ParticleDecays:limitTau0 = on','ParticleDecays:tau0Max = 10','ParticleDecays:allowPhotonRadiation = on'}
   pythia8PSweightsSettings: vstring tracked  = {'UncertaintyBands:doVariations = on','UncertaintyBands:List = {isrRedHi isr:muRfac=0.707,fsrRedHi fsr:muRfac=0.707,isrRedLo isr:muRfac=1.414,fsrRedLo fsr:muRfac=1.414,isrDefHi isr:muRfac=0.5,fsrDefHi fsr:muRfac=0.5,isrDefLo isr:muRfac=2.0,fsrDefLo fsr:muRfac=2.0,isrConHi isr:muRfac=0.25,fsrConHi fsr:muRfac=0.25,isrConLo isr:muRfac=4.0,fsrConLo fsr:muRfac=4.0,fsr_G2GG_muR_dn fsr:G2GG:muRfac=0.5,fsr_G2GG_muR_up fsr:G2GG:muRfac=2.0,fsr_G2QQ_muR_dn fsr:G2QQ:muRfac=0.5,fsr_G2QQ_muR_up fsr:G2QQ:muRfac=2.0,fsr_Q2QG_muR_dn fsr:Q2QG:muRfac=0.5,fsr_Q2QG_muR_up fsr:Q2QG:muRfac=2.0,fsr_X2XG_muR_dn fsr:X2XG:muRfac=0.5,fsr_X2XG_muR_up fsr:X2XG:muRfac=2.0,fsr_G2GG_cNS_dn fsr:G2GG:cNS=-2.0,fsr_G2GG_cNS_up fsr:G2GG:cNS=2.0,fsr_G2QQ_cNS_dn fsr:G2QQ:cNS=-2.0,fsr_G2QQ_cNS_up fsr:G2QQ:cNS=2.0,fsr_Q2QG_cNS_dn fsr:Q2QG:cNS=-2.0,fsr_Q2QG_cNS_up fsr:Q2QG:cNS=2.0,fsr_X2XG_cNS_dn fsr:X2XG:cNS=-2.0,fsr_X2XG_cNS_up fsr:X2XG:cNS=2.0,isr_G2GG_muR_dn isr:G2GG:muRfac=0.5,isr_G2GG_muR_up isr:G2GG:muRfac=2.0,isr_G2QQ_muR_dn isr:G2QQ:muRfac=0.5,isr_G2QQ_muR_up isr:G2QQ:muRfac=2.0,isr_Q2QG_muR_dn isr:Q2QG:muRfac=0.5,isr_Q2QG_muR_up isr:Q2QG:muRfac=2.0,isr_X2XG_muR_dn isr:X2XG:muRfac=0.5,isr_X2XG_muR_up isr:X2XG:muRfac=2.0,isr_G2GG_cNS_dn isr:G2GG:cNS=-2.0,isr_G2GG_cNS_up isr:G2GG:cNS=2.0,isr_G2QQ_cNS_dn isr:G2QQ:cNS=-2.0,isr_G2QQ_cNS_up isr:G2QQ:cNS=2.0,isr_Q2QG_cNS_dn isr:Q2QG:cNS=-2.0,isr_Q2QG_cNS_up isr:Q2QG:cNS=2.0,isr_X2XG_cNS_dn isr:X2XG:cNS=-2.0,isr_X2XG_cNS_up isr:X2XG:cNS=2.0}','UncertaintyBands:nFlavQ = 4','UncertaintyBands:MPIshowers = on','UncertaintyBands:overSampleFSR = 10.0','UncertaintyBands:overSampleISR = 10.0','UncertaintyBands:FSRpTmin2Fac = 20','UncertaintyBands:ISRpTmin2Fac = 1'}
  })
 }
 dependencies: {
  LHEEventProduct_externalLHEProducer__GEN.
 }

Clearly, the merge did NOT read the correct file. However, the file was chosen by AAA, not by the framework.

bbockelm commented 5 years ago

Well that's horrifying. AAA basically takes the site at its word that it is serving up the correct file. Can we involve the Brunel sysadmins?

amaltaro commented 5 years ago

From what Chris posted above (from the logs too):

Data is served from brunel.ac.uk instead of original site T2_IT_Pisa

how come it says original site is T2_IT_Pisa when the job is running at T2_TW_NCHC (thus, unmerged files are meant to be available over there as well).

It just came to my mind - it's very very unlikely but not impossible - that there could have been another job that ran at Brunel, staged that same file name out but failed at a later stage. If the cleanup job didn't manage to clean that file up, then well, it can be served through xrootd...

amaltaro commented 5 years ago

It just came to my mind - it's very very unlikely but not impossible - that there could have been another job that ran at Brunel, staged that same file name out but failed at a later stage. If the cleanup job didn't manage to clean that file up, then well, it can be served through xrootd...

It looks like it didn't happen. The job responsible for producing that file with lumi 917 (in vocms0253) succeeded in its first attempt. So there is no way a job could have left a sick file behind!

Dr15Jones commented 5 years ago

@amaltaro wrote

The job responsible for producing that file with lumi 917 (in vocms0253) succeeded in its first attempt.

Did you mean lumi 917 for this workflow? As far as I can tell, the file read containing lumi 917 was NOT from this workflow since its provenance is completely different (not even the same CMSSW release). My suggestion would be to try to find which file was actually read based on the configuration of the ExternalLHEProducer and Pythia8.

amaltaro commented 5 years ago

Yes, there were 2 files with the lumi section 917. One from vocms0253 (the correct/valid one), the other from vocms0254 (which has already been invalidated).

Yes, I made a ggus ticket involving the Brunel site admins: https://ggus.eu/index.php?mode=ticket_info&ticket_id=144246

let's see if we can track down where that file comes from and why it got served under that path.

At least we can conclude there is nothing wrong with WMCore/WMAgent here.

vlimant commented 5 years ago

The conclusions so far are a little scary. If one need more examples, just look under https://cms-unified.web.cern.ch/cms-unified/assistance.html#assistance-filemismatch all those have add recent file invalidations because of duplicates

vlimant commented 5 years ago

@Dr15Jones are you sure of CMSSW_10_2_16_patch2 and not CMSSW_10_2_16_patch1 ? I found https://cms-pdmv.cern.ch/mcm/requests?page=0&dataset_name=GluGluToHToTauTau*&cmssw_release=CMSSW_10_2_16_patch1 but nothing for patch2. the path /cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc630/13TeV/madgraph/V5_2.6.0/GluGluToHToTauTau/v1/GluGluToHToTauTauPlusTwoJets_M125_13TeV_amcatnloFXFX_pythia8_slc6_amd64_gcc630_CMSSW_9_3_8_tarball.tar.xz does not seem to follow the agreed naming convention. It almost look like the file that was served to the merge job, was NOT from central production ...

Dr15Jones commented 5 years ago

@vlimant Doing

edmProvDump -t root://cmsxrootd.fnal.gov//store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root

gives

-------------------------
Processing History:
  SIM '' '"CMSSW_10_2_16_patch2"' [1]  (17f0e17ab95b7fbef8bfd448939f3e9d)
---------Producers with data in file---------
amaltaro commented 5 years ago

I'd gladly hand the rest of this debugging (xrootd/site related, file chasing) to someone else... in any case, here are ggus tickets created for Pisa and for NCHC, hoping we can get some useful information: https://ggus.eu/?mode=ticket_info&ticket_id=144296&come_from=submit https://ggus.eu/?mode=ticket_info&ticket_id=144297&come_from=submit

I still find it hard to debug when I can't even make sense of this log:

Data is served from brunel.ac.uk instead of original site T2_IT_Pisa

which to me seems pretty much misleading! What does original site means here? How XRootD has this information? Is brunel.ac.uk the storage that the data comes from? Or is it another component of the XRootD eco-system (redirector or something like that?)?

nsmith- commented 5 years ago

From the FrameworkJobReport.xml of 6420a2d1-6f71-4aea-bbb7-955e67a1db05-0-2-logArchive.tar.gz you can see the GUID of the opened file, and already it doesn't match the LFN (which I think it should based on looking at other FrameworkJobReport examples):

<InputFile>
<State  Value="closed"/>
<LFN>/store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root</LFN>
<PFN>root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027</GUID>

So I went and searched for B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027, and found it in the WMArchive records on hadoop! There was a production job running at T2_UK_London_Brunel on that produced an output file with PFN davs://dc2-grid-64.brunel.ac.uk/dpm/brunel.ac.uk/home/cms/store/unmerged/RunIIFall18wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCP5_13TeV-mcatnloFXFX-pythia8/GEN-SIM/102X_upgrade2018_realistic_v11-v1/230000/B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.root. It has one lumisection, 917, and from what it looks like, all the metadata matches up with the provenance that Chris found. Stage-out finished at 2019-11-05 01:43:10 (not sure the timezone). For comparison, the job in Taiwan started reading the file at 06-Nov-2019 17:57:55 UTC. Later, there was a reconstruction job that also ran at Brunel and successfully read in this file using PFN root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/unmerged/RunIIFall18wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCP5_13TeV-mcatnloFXFX-pythia8/GEN-SIM/102X_upgrade2018_realistic_v11-v1/230000/B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.root. I put both records here: http://ncsmith.web.cern.ch/ncsmith/who_made_B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.json

nsmith- commented 4 years ago

I think the next step is to understand with Brunel and xrootd-cms.infn.it (Pisa?) whether there are logs of a request for B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027 or 81CB9803-B71C-464B-AD3F-5B53605EF41E to see where the filename-file mismatch happened.

bbockelm commented 4 years ago

@amaltaro - to decode this message:

Data is served from brunel.ac.uk instead of original site T2_IT_Pisa

This means that CMSSW tried to open the file at xrootd-cms.infn.it (the European redirector) which identified itself as named T2_IT_Pisa. Then it was redirected to a server at brunel.ac.uk (which must not have a site name set; DNS domain name is used in such a case).

I don't think there's much use in asking NCHC what's wrong. Either DPM at Brunel or the cmsd at the Pisa redirector failed massively here.

bbockelm commented 4 years ago

@vlimant - there are some interesting other files in the Unified link you sent

particularly, from here:

https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_task_HIG-RunIISummer19UL18wmLHEGEN-00233__v1_T_191030_134634_8195

can you pull the logs corresponding to HIG-RunIISummer19UL18wmLHEGEN-00233_0MergeRAWSIMoutput:8023?

bbockelm commented 4 years ago

Particularly, that job caught my eye because its error says the product named int_addPileupInfo_bunchSpacing_HLT. is in file /store/unmerged/RunIISummer19UL18wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/240000/468512E6-88D8-5C44-8A7E-344184CC62F3.root'.

Why are various HLT, RECO, and SIM products in a GEN file?!?

vlimant commented 4 years ago

there are no further log handy for 8023 error code indeed. one has to look in wmarchive

bbockelm commented 4 years ago

@nsmith is making progress in writing a script in automatically detecting these. In November, he's found 2,000 files that were incorrectly delivered; 100 resulted in successful job output.

One successfully merged file is /store/mc/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30000/0A4DEA82-E369-2E4A-902B-E762A4B3BACA.root -- which still exists in Xrootd if anyone wants to take a look.

bbockelm commented 4 years ago

(I should note that the incorrect, unmerged file in the case above came from Brunel as well)

nsmith- commented 4 years ago

This is all performed on the November FrameworkJobReport logs in hadoop, /cms/wmarchive/avro/fwjr/2019/11/*/*.avro. Since the WMArchive on hadoop also records the GUID read from the input file upon opening, I wrote something to compare that field to the value extracted from PFN/LFN. At first there were many mismatches, but a large class of them turned out to be a bug in the way cmssw handles DQM files, see https://github.com/cms-sw/cmssw/issues/28534

So after removing those, we are left with about 3000 files, where the PFN always starts with root protocol (no local, etc.)

prefix
root://cms-xrootd.gridpp.ac.uk/                                 83
root://cms-xrootd.gridpp.ac.uk:1094/                            25
root://cmsxrootd.fnal.gov/                                     384
root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms      54
root://heplnv139.pp.rl.ac.uk:1094/                             131
root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms          1
root://node12.datagrid.cea.fr/                                  26
root://polgrid4.in2p3.fr/                                       37
root://se01.indiacms.res.in/                                     1
root://xrootd-cms.infn.it/                                    1712
root://xrootd-cms.infn.it:1194/                                307
root://xrootd-es.pic.es:1096/                                  151
root://xrootd.t2.ucsd.edu:2040/                                 12
Name: pfn, dtype: int64

There is some probability that the same collection of job reports also includes the record of producing the input file which was actually read by the process. This can be found by searching for the output guid. So I did an inner join on that, still only considering November, and found about 600 entries. Now we know what site produced the file. If we filter out any files that are not /store/unmerged, we can be sure that the site is also the only possible site the file could be read from. This leaves:

fwjr_site
T2_FR_GRIF_IRFU          2
T2_FR_GRIF_LLR           4
T2_PL_Swierk             9
T2_UK_London_Brunel    670
T2_UK_SGrid_RALPP        1
Name: pfn, dtype: int64

I've inspected a few cases in detail. The one Brian mentions above appears quite similar to the case we were originally analyzing: the job at Nebraska appeared to open a file from Brunel that should have been at Nebraska:

11-Nov-2019 23:42:16 UTC  Initiating request to open file /mnt/hadoop/user/uscms01/pnfs/unl.edu/data4/cms/store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root
11-Nov-2019 23:42:16 UTC  Fallback request to file root://cmsxrootd.fnal.gov//store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root
%MSG-w XrdAdaptor:  file_open 11-Nov-2019 23:42:26 UTC pre-events
Data is served from brunel.ac.uk instead of original site US-Regional-Nebraska
%MSG
11-Nov-2019 23:42:39 UTC  Successfully opened file root://cmsxrootd.fnal.gov//store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root

and in the FrameworkJobReport we see the GUID mismatch:

<InputFile>
<State  Value="closed"/>
<LFN>/store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root</LFN>
<PFN>root://cmsxrootd.fnal.gov//store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>1088D615-B304-EA11-8B83-C45444922B77</GUID>
<Branches>
  <Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<InputType>primaryFiles</InputType>
<InputSourceClass>PoolSource</InputSourceClass>
<EventsRead>1524</EventsRead>
<Runs>
<Run ID="1">
   <LumiSection ID="194539"/>
</Run>

</Runs>
</InputFile>

The file with that GUID is

davs://dc2-grid-64.brunel.ac.uk/dpm/brunel.ac.uk/home/cms/store/unmerged/RunIISummer15wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCUETP8M1_13TeV-mcatnloFXFX-pythia8/LHE/MCRUN2_71_V1-v1/60038/1088D615-B304-EA11-8B83-C45444922B77.root

produced by a job at Brunel. Logs for the Nebraska job can be found in 06aaba88-074f-4f18-bd60-3984faff4f38-30-0-logArchive.tar.gz inside the LogArchive at

root://eoscms.cern.ch//store/logs/prod/2019/11/WMAgent/pdmvserv_task_HIG-RunIISummer19UL18wmLHEGEN-00231__v1_T_191030_054827_8689/pdmvserv_task_HIG-RunIISummer19UL18wmLHEGEN-00231__v1_T_191030_054827_8689-HIG-RunIISummer19UL18wmLHEGEN-00231_0RAWSIMoutputMergeLogCollect-rice-a461-11-logs.tar

A more interesting case is a job at T2_UK_SGrid_RALPP that attempted to read a file but instead ended up opening a file that it had opened moments earlier while running:

26-Nov-2019 13:17:57 UTC  Initiating request to open file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root
[2019-11-26 13:20:58.760667 +0000][Error  ][XRootD            ] [[2001:630:58:1c20::82f6:2fb6]:50925] Unable to get the response to request kXR_read (handle: 0x00000000, offset: 0, size: 300)
[2019-11-26 13:21:10.060894 +0000][Error  ][File              ] [0xcee0a980@root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root] Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 0, size: 300) returned with [ERROR] Operation expired
%MSG-w XrdAdaptorInternal:  (NoModuleName) 26-Nov-2019 13:21:10 UTC pre-events
XrdRequestManager::handle(name='root://mover.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root) failure when reading from [2001:630:58:1c20::82f6:2fb6]:50925 (unknown site); failed with error '[ERROR] Operation expired' (errno=0, code=206).
%MSG
%MSG-w XrdAdaptorInternal:  (NoModuleName) 26-Nov-2019 13:21:10 UTC pre-events
Request failure when reading from [2001:630:58:1c20::82f6:2fb6]:50925 (unknown site)
%MSG
%MSG-w XrdAdaptor:  (NoModuleName) 26-Nov-2019 13:21:10 UTC pre-events
Data is now served from  instead of previous [2001
%MSG
[2019-11-26 13:21:43.201866 +0000][Error  ][AsyncSock         ] [[2001:630:58:1c20::82f6:2fb6]:50925 #0.0] Socket error while handshaking: [ERROR] Socket timeout
[2019-11-26 13:22:28.396188 +0000][Error  ][AsyncSock         ] [[2001:630:58:1c20::82f6:2fb6]:50925 #0.0] Socket error while handshaking: [ERROR] Socket timeout
[2019-11-26 13:22:28.396431 +0000][Error  ][PostMaster        ] [[2001:630:58:1c20::82f6:2fb6]:50925 #0] Unable to recover: [ERROR] Socket timeout.
[2019-11-26 13:22:28.396502 +0000][Error  ][XRootD            ] [[2001:630:58:1c20::82f6:2fb6]:50925] Impossible to send message kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=807b3cf2-4af3-4d3d-aebf-8937ae044c84&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e, mode: 0660, flags: kXR_open_read ). Trying to recover.
[2019-11-26 13:22:28.396651 +0000][Error  ][XRootD            ] [[2001:630:58:1c20::82f6:2fb6]:50925] Impossible to send message kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=fcba3e95-e929-425b-a5fe-d5f99b8fd810&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e&tried=[2001, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ). Trying to recover.
%MSG-w XrdAdaptorInternal:  (NoModuleName) 26-Nov-2019 13:24:20 UTC pre-events
Caught a CMSSW exception when running connection recovery.
%MSG
26-Nov-2019 13:24:20 UTC  Fallback request to file root://cms-xrootd.gridpp.ac.uk:1094//store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root
[2019-11-26 13:24:21.065115 +0000][Error  ][XRootD            ] [mover.pp.rl.ac.uk:1094] Unable to get the response to request kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=fcba3e95-e929-425b-a5fe-d5f99b8fd810&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e&tried=[2001,[2001:630:58:1c20::82f6:2fb6], mode: 0660, flags: kXR_open_read kXR_async kXR_retstat )
[2019-11-26 13:24:21.065739 +0000][Error  ][XRootD            ] [mover.pp.rl.ac.uk:1094] Unable to get the response to request kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=807b3cf2-4af3-4d3d-aebf-8937ae044c84&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e&tried=[2001:630:58:1c20::82f6:2fb6], mode: 0660, flags: kXR_open_read )
%MSG-w XrdAdaptor:  file_open 26-Nov-2019 13:24:33 UTC PostProcessEvent
Data is served from T2_UK_SGrid_RALPP instead of original site T2_UK_London_IC
%MSG
26-Nov-2019 13:24:33 UTC  Successfully opened file root://cms-xrootd.gridpp.ac.uk:1094//store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root
%MSG-w MemoryCheck:  source 26-Nov-2019 13:24:36 UTC PostBeginLumi
MemoryCheck: module source:source VSIZE 3742.29 0.5 RSS 2086.21 -314.145
%MSG
Begin processing the 579th record. Run 1, Event 2069622346, LumiSection 731501 at 26-Nov-2019 13:24:36.252 UTC
... snip ...
26-Nov-2019 13:28:39 UTC  Initiating request to open file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/B2F7DE91-6C0F-EA11-94D4-002590DE3AC0.root
26-Nov-2019 13:28:40 UTC  Successfully opened file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/B2F7DE91-6C0F-EA11-94D4-002590DE3AC0.root
%MSG-w DuplicateEvent:  PostBeginLumi 26-Nov-2019 13:28:40 UTC PostBeginLumi
Duplicate Events found in entire set of input files.
Both events were from run 1 and luminosity block 731501 with event number 2069622346.
The duplicate was from file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/B2F7DE91-6C0F-EA11-94D4-002590DE3AC0.root.
The duplicate will be skipped.

%MSG

Logs for this job can be found in 076c36c0-9798-4ea5-a711-7836a5c5e6df-6-0-logArchive.tar.gz inside the LogArchive at

root://eoscms.cern.ch//store/logs/prod/2019/12/WMAgent/pdmvserv_task_SMP-RunIISummer15GS-00275__v1_T_191009_191932_9218/pdmvserv_task_SMP-RunIISummer15GS-00275__v1_T_191009_191932_9218-LogCollectForSMP-RunIISummer16DR80-00068_0-heplnc131-12-logs.tar
nsmith- commented 4 years ago

There are very few cases involving data in this timeframe, but one is rather interesting and looks like possibly a different root cause. This RAW data file has a different UUID than its LFN:

$ edmFileUtil -u root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root
root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root
root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root (1 runs, 2 lumis, 5487 events, 2937042356 bytes, 327FE446-F9B7-E811-9C31-FA163E5394F5 uuid)
vlimant commented 4 years ago

In November, he's found 2,000 files that were incorrectly delivered; 100 resulted in successful job output.

does this mean that some samples have events from different physics process than expected ? looks to me that these need to be identified, to form a list that ppd has to resubmit.

bbockelm commented 4 years ago

does this mean that some samples have events from different physics process than expected

Yes, that's precisely my concern. @nsmith- is there a way to get the list of affected datasets (including parentage?)

nsmith- commented 4 years ago

yeah with enough spark queries I should be able to make a list, working on it

nsmith- commented 4 years ago

Ok, I expanded the scope to consider all production job reports from 2018 and 2019. Here is a plot of the frequency of such GUID mismatches per week:

image

The unknown component contains both events where: the GUID corresponding to the file that was actually opened could not be found in the same records (i.e. non-production file such as user data or a file created before 2018); and where the file was found but it was a file in the catalog (i.e. /store/unmerged/ was not found in the LFN) so that there is no definitive way to know which site actually hosted and served the file at that point, since it could have been replicated by PhEDEx.

I compiled a list of all successful jobs that read at least one event from an incorrectly opened file and produced output, following unmerged output to the first occurrence of an output file in the catalog. They can be found in input-guid-mismatch-cataloged-output.txt. These files, and their children (if any) should be invalidated and recreated if need be.

I looked a bit to see if there was any correlation between when the files were created and accessed, but I think little to no conclusion can be drawn as one mostly sees an exponential tail from the lifetime of unmerged files, on top of a background of access to permanent, cataloged files. There may be some value in also checking CRAB jobreports for the same issue, so I might try that next.

P.S. some highlights include this MC file which is actually data: https://cmsweb.cern.ch/das/request?instance=prod/global&input=run+file%3D%2Fstore%2Fmc%2FRunIISummer19UL17wmLHEGEN%2FST_t-channel_top_4f_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8%2FLHE%2F106X_mc2017_realistic_v6-v2%2F310000%2F9C9CEF31-6407-0E43-B541-ADB37CE714D3.root and this data file which is actually MC: https://cmsweb.cern.ch/das/request?instance=prod/global&input=run+file%3D%2Fstore%2Fdata%2FRun2018B%2FParkingBPH2%2FMINIAOD%2F05May2019-v2%2F30001%2F8DD2A46D-6ABE-6348-8F3B-52C0AD1A55EE.root

bbockelm commented 4 years ago

@nsmith- Are we ready to invalidate data/datasets? Can we build up a list and send it to comp-ops for review?

nsmith- commented 4 years ago

Pradeep and I are working on it, just specific files. Once invalidated, we report to PdMV and they can decide if anything need reproducing.

Dr15Jones commented 4 years ago

I compiled a list of all successful jobs that read at least one event from an incorrectly opened file and produced output, following unmerged output to the first occurrence of an output file in the catalog. They can be found in input-guid-mismatch-cataloged-output.txt. These files, and their children (if any) should be invalidated and recreated if need be.

I took a look at one of the files on the list /store/data/Run2018B/ParkingBPH2/MINIAOD/05May2019-v2/30001/8DD2A46D-6ABE-6348-8F3B-52C0AD1A55EE.root

Doing

edmProvDump /store/data/Run2018B/ParkingBPH2/MINIAOD/05May2019-v2/30001/8DD2A46D-6ABE-6348-8F3B-52C0AD1A55EE.root

didn't show anything amiss with the histories of the files used when creating this file from a merge. What inconsistency did you find for this file?

makortel commented 4 years ago

One inconsistency is that the path/dataset claims the file to be data, but the edmProvDump shows that the file is clearly from simulation

Processing History: 
  SIM '' '"CMSSW_9_3_6"' [1]  (785c63f5fd8d7bdccf2c9db5c4a14fea)
    HLT '' '"CMSSW_9_4_0_patch1"' [1]  (9d8f590933956796cd3e18bfb7642a20)
      RECO '' '"CMSSW_9_4_0_patch1"' [1]  (351414bbda2cdc38d49da1680cef2a3f)
nsmith- commented 4 years ago

index The rate has gone down significantly this month, but is still nonzero.

nsmith- commented 4 years ago

The cmssw exit codes that are most probable to occur with the guid mismatch (# errors with guid / # errors total) are:

                   rate
stepErrorCode          
70452.0        0.178694
8026.0         0.138889
8009.0         0.093502
87.0           0.040577
8023.0         0.035229
makortel commented 4 years ago

No 8034 (the specific exit code for the mismatch) in there yet?

nsmith- commented 4 years ago

So it seems. I believe it won't be in production until https://github.com/dmwm/WMCore/issues/9468 is implemented

vlimant commented 4 years ago

one new such (among others) workflows

https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer15wmLHEGS-02719

vlimant commented 4 years ago

with the potential of screwing up with so many signal samples in analysis, this should have a little higher priority (higher than high priority) if you ask me