Closed amaltaro closed 4 years ago
Adding some extra information...
According to unified logs, these were the 2 files invalidated (and their respective lumi sections): https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root
wmstats claims that that workflow was processed by the following 3 agents (again, no ACDCs or other workflows writing to the same GEN dataset): agent url: vocms0252.cern.ch,vocms0254.cern.ch,vocms0253.cern.ch
Checking the dbsbuffer tables on those 3 agents, querying for any files matching that LFN and that produced a/some of those 7 lumi sections:
select df.lfn, dfrm.* from dbsbuffer_file df
INNER JOIN dbsbuffer_file_runlumi_map dfrm ON df.id = dfrm.filename
WHERE df.lfn like '/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/%'
AND dfrm.lumi in (3226, 3231, 351, 249, 1381, 3199, 917) ORDER BY df.lfn, dfrm.lumi;
the result is:
vocms0252:
LFN | FILEID | RUN | LUMI | NUM_EVENTS
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root 500282 1 249 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root 500282 1 351 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root 500282 1 1381 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root 500282 1 3199 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root 500282 1 3226 300
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root 500282 1 3231 300
vocms0253:
LFN | FILEID | RUN | LUMI | NUM_EVENTS
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/239E7C08-053C-3544-AD9E-DF30AB97E5C4.root 514778 1 917 96
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/567BF783-3D2D-CD46-A2B5-83821C2C5B89.root 517550 1 249 96
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/5DC32B3F-F244-D540-B640-0508B20908FB.root 514183 1 1381 97
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/97972246-0E36-5E43-9F52-4368FFD47DE5.root 513983 1 351 109
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/AAB0B18B-BAE0-4243-8B2F-1834280C925D.root 514764 1 3226 97
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/AAB0B18B-BAE0-4243-8B2F-1834280C925D.root 514764 1 3231 98
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/30000/EA04CEEB-C101-3E41-967D-2A35A81E6D68.root 514262 1 3199 101
vocms0254:
LFN | FILEID | RUN | LUMI | NUM_EVENTS
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root 428389 1 917 118
in short, vocms0253 produced all 7 lumis, vocms0254 produced 1 lumi, and vocms0252 produced 6 lumis.
Next step is to investigate what were the input WQE lumi mask and which jobs produced those.
Ok, now checking what the input file (MCFakeFile) was and their lumi mask, so this query:
select wfd.lfn, wfrm.* from wmbs_file_details wfd
INNER JOIN wmbs_file_runlumi_map wfrm ON wfd.id = wfrm.fileid
where wfd.lfn like 'MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301%'
AND wfrm.lumi IN (3226, 3231, 351, 249, 1381, 3199, 917) ORDER by wfrm.lumi;
and surprise! There are no results in vocms0252 and vocms0254. Only vocms0253 was supposed to have processed those lumis (all from the same MCFakeFile, thus the same WQE):
LFN | FILEID | RUN | LUMI | NUM_EVENTS
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 249
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 351
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 917
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 1381
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 3199
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 3226
MCFakeFile-pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301-HIG-RunIISummer19UL17wmLHEGEN-00493_0-3e0c2af7e0fa0fd3001e29603c311ba8 4790013 1 3231
Okay! Now that I managed to resume investigating it...
If I hand pick one of those duplicate lumi sections, e.g. either 249
or 917
, and check for jobs that got those lumi sections assigned to, I cannot find any results but in vocms0253
, which is the node that really acquired WQEs including those lumi sections.
So running a query like
select wjm.job, wjm.firstrun, wjm.lastrun, wjm.firstlumi, wjm.lastlumi from wmbs_job_mask wjm
INNER JOIN wmbs_job wj ON wjm.job = wj.id
WHERE wjm.firstlumi>900 and wjm.lastlumi<1000
and wj.cache_dir like '%/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/HIG-RunIISummer19UL17wmLHEGEN-00493_0/JobCollection_%';
yield these results in vocms0253 (nothing on the other two agents):
JOB | FIRSTRUN | LASTRUN | FIRSTLUMI | LASTLUMI
1980252 1 1 913 928
1980253 1 1 929 944
1980254 1 1 945 960
1980255 1 1 961 976
1980256 1 1 977 992
which sort of makes me believe we've got a lumi section (from the FJR) that we didn't expect too.
Checking the least painful way to find logs for those files now...
Okay, looking at one of the files that got invalidated (with lumis (3226, 3231, 351, 249, 1381, 3199)):
/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20000/50C82023-B12A-794A-9638-116C5FD179C6.root
it was part of this block:
/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/RunIISummer19UL17wmLHEGEN-106X_mc2017_realistic_v6-v1/GEN#2b1c32bd-87cb-46d6-ba3e-66b8aa90a465
which had a single file, the one invalidated above, and was produced at T2_PK_NCP
.
With this information, and checking when the file was injected into PhEDEx (against PhEDExInjector logs), I could reduce the time window to around Nov 3rd to 5th. Now using the following query in es-cms.cern.ch
ScheddName:vocms0252.cern.ch AND CMS_JobType:Merge AND Site:T2_PK_NCP AND WMAgent_RequestName:pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301
I found a few jobs, and finally the one that created the file with 6 duplicate lumis!!!!
This was the Merge
job that produced the duplicate lumis in vocms0252:
/data/srv/wmagent/current/install/wmagent/JobArchiver/logDir/p/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/JobCluster_2124/Job_2124060.tar.bz2
and I checked EVERY single input unmerged file against the wmbs_file_runlumi_map
table and none of the lumi sections correspond to the list of lumi sections in the output file (from the WMAgent Report file).
The logArchive for this merge job was put at:
/store/unmerged/logs/prod/2019/11/4/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/HIG-RunIISummer19UL17wmLHEGEN-00493_0/HIG-RunIISummer19UL17wmLHEGEN-00493_0MergeRAWSIMoutput/0000/4/2807d08a-612d-4fb0-952f-f642fba819f0-0-4-logArchive.tar.gz
so we should look into that and see what the CMSSW framework job report says. Another option would be to actually open this merge file and look at its content (but it got invalidated already, so it's not an option anymore).
A dump of the job report can be found here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/reportDump_2165285.txt
Trying put my hands on that logArchive, to see what the FJR says. If it reports the same lumis that got reported in the Report.pkl file, then our last option would be to open the root file (invalidated) itself. Which AFAIK is no longer available and nothing else can be done (neither the root cause understood).
FYI @vlimant
the file is still out there, xrdcp got it /afs/cern.ch/user/v/vlimant/public/ops/50C82023-B12A-794A-9638-116C5FD179C6.root
the interesting part is with
/store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root
that contains non consecutive lumisections indeed, with atypical number of events (300), for a configuration that has a filter efficiency
What you need is the log of the job that produced that file
there are no such logs of job that produced that file under /eos/cms/store/unmerged/logs/prod/2019/11/4/pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301/HIG-RunIISummer19UL17wmLHEGEN-00493_0/00//*
And here is the WMCore log: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/Job_2046469.tar.bz2
for the job that produced the unmerged file /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root
Here is a dump of the Report.pkl parsed by the agent: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/reportDump_2046469.txt
and from the wmagentJob.log
, this was the job mask:
'mask': {'LastRun': 1, 'LastLumi': 747584, 'FirstRun': 1, 'inclusivemask': True, 'runAndLumis': {}, 'LastEvent': 458268992, 'FirstEvent': 458259185, 'jobID': 2046469, 'FirstLumi': 747569}
which matches with what gets reported to both the LHE and the GEN files.
@vlimant how did you check it has non-consecutive lumi sections?
At this point, I think something funky happened in the framework... I'm still trying to put my hands on that log.
Apparently the CMSSW log tarball that created the unmerged file will never be found. The LogCollect job failed to access logArchive files, but the Cleanup didn't fail to delete them!
I'm done looking for logs. My previous comment makes it clear that the agent passed the right lumi mask to the PSet. IF it hadn't, then we should have seen duplicates in the LHE dataset as well (both LHE and GEN were produced by the same job).
Ok, I managed to fetch all the logs we might need and they are all placed here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/vocms0254/
In short, the production job was setup to produce 16 lumis from 538273 to 538288, and that's exactly what I see in the output WMCore report, in the CMSSW FJR and in the database (for the unmerged file).
Then this file went to a merge job (with a few other files that got skipped), the merge job has no lumi mask, just a list of unmerged files as input. The output WMCore report reports lumi 917, the CMSSW FJR also reports 917 and so is the database. Question is, where does that lumi come from?
I also managed to download that unmerged file and here are its lumi sections
amaltaro@lxplus731:~/CMSSW_10_6_4/src $ edmLumisInFiles.py 81CB9803-B71C-464B-AD3F-5B53605EF41E.root
{"1": [[538273, 538288]]}
I guess we need to hand it over to the Framework team.
"non consecutive lumisections" from http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/reportDump_2165285.txt
'input': {'source': [{'catalog': '',
'events': 1800,
'guid': '1C1EB1B2-DE76-E711-8E08-0025905A60B6',
'input_source_class': 'PoolSource',
'input_type': 'primaryFiles',
'lfn': '/store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root',
'module_label': 'source',
'pfn': 'root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/20012/D094699F-23CD-4B44-9FCB-D07A4E355259.root',
'runs': {'1': {249: None,
351: None,
1381: None,
3199: None,
3226: None,
3231: None}}}]},
I think it's time to label you here Chris @Dr15Jones The short story was posted here: https://github.com/dmwm/WMCore/issues/9432#issuecomment-557911245
could you please have a look at those logs (or tag someone else that can) to see whether it's some nasty bug in CMSSW; or whether I'm missing some detail in WMCore.
Honestly, I don't see that lumi section coming from anything managed by WMCore. Please let me know if you want an specific issue created in the cmssw repo too.
FYI @bbockelm
So looking at the log from 6420a2d1-6f71-4aea-bbb7-955e67a1db05-0-2-logArchive.tar.gz we see that the first 6 files passed to the job all fail to open and are skipped, e.g.
06-Nov-2019 17:55:12 UTC Initiating request to open file root://se01.grid.nchc.org.tw//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root
[2019-11-06 17:55:12.281465 +0000][Error ][XRootD ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_query (code: kXR_Qconfig, arg length: 9): [FATAL] Connection error
[2019-11-06 17:55:12.281561 +0000][Error ][XRootD ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Connection error
06-Nov-2019 17:55:12 UTC Fallback request to file root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root
[2019-11-06 17:56:27.285088 +0000][Warning][XRootD ] [xrootd-cms.infn.it:1094] Redirect limit has been reached for message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=+1413xrootd.ba.infn.it,se01.grid.nchc.org.tw,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it&triedrc=enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ), the last known error is: [ERROR] Error response: No such file or directory
%MSG-w XrdAdaptorInternal: file_open 06-Nov-2019 17:56:27 UTC pre-events
Failed to open file at URL root://xrootd-cms.infn.it:1094//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=+1413xrootd.ba.infn.it,se01.grid.nchc.org.tw.
%MSG
[2019-11-06 17:57:40.294654 +0000][Warning][XRootD ] [xrootd-cms.infn.it:1094] Redirect limit has been reached for message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=+1413xrootd.ba.infn.it,se01.grid.nchc.org.tw,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it,xrootd.ba.infn.it&triedrc=enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent,enoent, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ), the last known error is: [ERROR] Error response: No such file or directory
%MSG-w XrdAdaptorInternal: file_open 06-Nov-2019 17:57:40 UTC pre-events
Failed to open file at URL root://xrootd-cms.infn.it:1094//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root?tried=.
%MSG
%MSG-w : AfterFile 06-Nov-2019 17:57:40 UTC pre-events
Input file: root://se01.grid.nchc.org.tw//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40029/9575E7CE-B4DD-6545-814A-5EEA9C6D7DA1.root was not found or could not be opened, and will be skipped.
Then the 7th one fails to open at se01.grid.nchc.org.tw, gets redirected to root://xrootd-cms.infn.it but then reports
%MSG-w XrdAdaptor: file_open 06-Nov-2019 17:57:52 UTC pre-events
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
%MSG
and then starts processing Run 1, Event 1605760, LumiSection 917
The full part of the log related to that is
%MSG
06-Nov-2019 17:57:40 UTC Initiating request to open file root://se01.grid.nchc.org.tw//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
[2019-11-06 17:57:40.301526 +0000][Error ][XRootD ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_query (code: kXR_Qconfig, arg length: 9): [FATAL] Connection error
[2019-11-06 17:57:40.301604 +0000][Error ][XRootD ] [se01.grid.nchc.org.tw:1094] Unable to send the message kXR_open (file: /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Connection error
06-Nov-2019 17:57:40 UTC Fallback request to file root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
%MSG-w XrdAdaptor: file_open 06-Nov-2019 17:57:52 UTC pre-events
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
%MSG
06-Nov-2019 17:57:55 UTC Successfully opened file root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
Begin processing the 1st record. Run 1, Event 1605760, LumiSection 917 on stream 0 at 06-Nov-2019 18:04:18.938 UTC
Begin processing the 2nd record. Run 1, Event 1605768, LumiSection 917 on stream 0 at 06-Nov-2019 18:04:19.542 UTC
If I now do
edmFileUtil -e root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root
I find that the file exists, but does not contain either LuminosityBlock 917 nor event number 1605760.
The only conclusion I can draw, is when the redirection happened, the system sent data from the wrong file. Is it possible that around this time, a file with a similar name was sitting at brunel.ac.uk and the redirection picked up that file?
@amaltaro Is it possible to get access to the merged file from the job? That job report says the GUID if the file is 3B506A4B-B36A-374A-B87A-CB2FA045AAE4 with a LFN of
<LFN>/store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/Merged.root</LFN>
Looking at the internal provenance of the file might give us a clue as to the actual origin of the file which was read during the merge job.
Thanks for looking into this, Chris. Yes, I managed to retrieve that file and it's available here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_9432/vocms0254/
or if you want to go straight to the source, you should also be able to retrieve it via
xrdcp root://cmsxrootd.fnal.gov//store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root .
Using edmProvDump I found the file which was read during the merge was generated using CMSSW_10_2_16_patch2. It used the externalLHEProducer configured as
Module: externalLHEProducer SIM
PSet id:2496cdda1332a2dab7b18d789f779ff2
products: {
LHEEventProduct_externalLHEProducer__SIM.
LHERunInfoProduct_externalLHEProducer__SIM.
}
parameters: {
@module_edm_type: string tracked = 'EDProducer'
@module_label: string tracked = 'externalLHEProducer'
@module_type: string tracked = 'ExternalLHEProducer'
args: vstring tracked = {'/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc630/13TeV/madgraph/V5_2.6.0/GluGluToHToTauTau/v1/GluGluToHToTauTauPlusTwoJets_M125_13TeV_amcatnloFXFX_pythia8_slc6_amd64_gcc630_CMSSW_9_3_8_tarball.tar.xz'}
numberOfParameters: uint32 tracked = 1
outputFile: string tracked = 'cmsgrid_final.lhe'
scriptName: FileInPath tracked = V001 GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh 2 /src/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh
}
It also has generator
set as Pythia8HanronizerFilter configured as
Module: generator SIM
PSet id:4aeb52791412d998a0497f75dd7dd209
products: {
GenEventInfoProduct_generator__SIM.
GenFilterInfo_generator__SIM.
GenLumiInfoHeader_generator__SIM.
GenLumiInfoProduct_generator__SIM.
GenRunInfoProduct_generator__SIM.
edmHepMCProduct_generator_unsmeared_SIM.
}
parameters: {
@module_edm_type: string tracked = 'EDFilter'
@module_label: string tracked = 'generator'
@module_type: string tracked = 'Pythia8HadronizerFilter'
comEnergy: double tracked = 13000
HepMCFilter: PSet tracked = ({
filterName: string tracked = 'EmbeddingHepMCFilter'
filterParameters: PSet tracked = ({
BosonPDGID: int32 tracked = 25
ElElCut: string tracked = 'El1.Pt > 22 && El2.Pt > 10 && El1.Eta < 2.6 && El2.Eta < 2.6'
ElHadCut: string tracked = 'El.Pt > 22 && Had.Pt > 16 && El.Eta < 2.6 && Had.Eta < 2.7'
ElMuCut: string tracked = 'Mu.Pt > 9 && El.Pt > 11 && El.Eta < 2.6 && Mu.Eta < 2.5'
Final_States: vstring tracked = {'ElHad','ElMu','HadHad','MuHad'}
HadHadCut: string tracked = 'Had1.Pt > 33 && Had2.Pt > 33 && Had1.Eta < 2.5 && Had2.Eta < 2.5'
MuHadCut: string tracked = 'Mu.Pt > 19 && Had.Pt > 16 && Mu.Eta < 2.5 && Had.Eta < 2.7'
MuMuCut: string tracked = 'Mu1.Pt > 17 && Mu2.Pt > 8 && Mu1.Eta < 2.5 && Mu2.Eta < 2.5'
})
})
PythiaParameters: PSet tracked = ({
parameterSets: vstring tracked = {'pythia8CommonSettings','pythia8CP5Settings','pythia8aMCatNLOSettings','pythia8PSweightsSettings','processParameters'}
processParameters: vstring tracked = {'JetMatching:setMad = off','JetMatching:scheme = 1','JetMatching:merge = on','JetMatching:jetAlgorithm = 2','JetMatching:etaJetMax = 999.','JetMatching:coneRadius = 1.','JetMatching:slowJetPower = 1','JetMatching:qCut = 30.','JetMatching:doFxFx = on','JetMatching:qCutME = 10.','JetMatching:nQmatch = 5','JetMatching:nJetMax = 2','SLHA:useDecayTable = off','25:m0 = 125.0','25:onMode = off','25:onIfMatch = 15 -15','Higgs:useBSM = on','HiggsH1:parity = 1'}
pythia8CP5Settings: vstring tracked = {'Tune:pp 14','Tune:ee 7','MultipartonInteractions:ecmPow=0.03344','PDF:pSet=20','MultipartonInteractions:bProfile=2','MultipartonInteractions:pT0Ref=1.41','MultipartonInteractions:coreRadius=0.7634','MultipartonInteractions:coreFraction=0.63','ColourReconnection:range=5.176','SigmaTotal:zeroAXB=off','SpaceShower:alphaSorder=2','SpaceShower:alphaSvalue=0.118','SigmaProcess:alphaSvalue=0.118','SigmaProcess:alphaSorder=2','MultipartonInteractions:alphaSvalue=0.118','MultipartonInteractions:alphaSorder=2','TimeShower:alphaSorder=2','TimeShower:alphaSvalue=0.118'}
pythia8CommonSettings: vstring tracked = {'Tune:preferLHAPDF = 2','Main:timesAllowErrors = 10000','Check:epTolErr = 0.01','Beams:setProductionScalesFromLHEF = off','SLHA:keepSM = on','SLHA:minMassSM = 1000.','ParticleDecays:limitTau0 = on','ParticleDecays:tau0Max = 10','ParticleDecays:allowPhotonRadiation = on'}
pythia8PSweightsSettings: vstring tracked = {'UncertaintyBands:doVariations = on','UncertaintyBands:List = {isrRedHi isr:muRfac=0.707,fsrRedHi fsr:muRfac=0.707,isrRedLo isr:muRfac=1.414,fsrRedLo fsr:muRfac=1.414,isrDefHi isr:muRfac=0.5,fsrDefHi fsr:muRfac=0.5,isrDefLo isr:muRfac=2.0,fsrDefLo fsr:muRfac=2.0,isrConHi isr:muRfac=0.25,fsrConHi fsr:muRfac=0.25,isrConLo isr:muRfac=4.0,fsrConLo fsr:muRfac=4.0,fsr_G2GG_muR_dn fsr:G2GG:muRfac=0.5,fsr_G2GG_muR_up fsr:G2GG:muRfac=2.0,fsr_G2QQ_muR_dn fsr:G2QQ:muRfac=0.5,fsr_G2QQ_muR_up fsr:G2QQ:muRfac=2.0,fsr_Q2QG_muR_dn fsr:Q2QG:muRfac=0.5,fsr_Q2QG_muR_up fsr:Q2QG:muRfac=2.0,fsr_X2XG_muR_dn fsr:X2XG:muRfac=0.5,fsr_X2XG_muR_up fsr:X2XG:muRfac=2.0,fsr_G2GG_cNS_dn fsr:G2GG:cNS=-2.0,fsr_G2GG_cNS_up fsr:G2GG:cNS=2.0,fsr_G2QQ_cNS_dn fsr:G2QQ:cNS=-2.0,fsr_G2QQ_cNS_up fsr:G2QQ:cNS=2.0,fsr_Q2QG_cNS_dn fsr:Q2QG:cNS=-2.0,fsr_Q2QG_cNS_up fsr:Q2QG:cNS=2.0,fsr_X2XG_cNS_dn fsr:X2XG:cNS=-2.0,fsr_X2XG_cNS_up fsr:X2XG:cNS=2.0,isr_G2GG_muR_dn isr:G2GG:muRfac=0.5,isr_G2GG_muR_up isr:G2GG:muRfac=2.0,isr_G2QQ_muR_dn isr:G2QQ:muRfac=0.5,isr_G2QQ_muR_up isr:G2QQ:muRfac=2.0,isr_Q2QG_muR_dn isr:Q2QG:muRfac=0.5,isr_Q2QG_muR_up isr:Q2QG:muRfac=2.0,isr_X2XG_muR_dn isr:X2XG:muRfac=0.5,isr_X2XG_muR_up isr:X2XG:muRfac=2.0,isr_G2GG_cNS_dn isr:G2GG:cNS=-2.0,isr_G2GG_cNS_up isr:G2GG:cNS=2.0,isr_G2QQ_cNS_dn isr:G2QQ:cNS=-2.0,isr_G2QQ_cNS_up isr:G2QQ:cNS=2.0,isr_Q2QG_cNS_dn isr:Q2QG:cNS=-2.0,isr_Q2QG_cNS_up isr:Q2QG:cNS=2.0,isr_X2XG_cNS_dn isr:X2XG:cNS=-2.0,isr_X2XG_cNS_up isr:X2XG:cNS=2.0}','UncertaintyBands:nFlavQ = 4','UncertaintyBands:MPIshowers = on','UncertaintyBands:overSampleFSR = 10.0','UncertaintyBands:overSampleISR = 10.0','UncertaintyBands:FSRpTmin2Fac = 20','UncertaintyBands:ISRpTmin2Fac = 1'}
pythia8aMCatNLOSettings: vstring tracked = {'SpaceShower:pTmaxMatch = 1','SpaceShower:pTmaxFudge = 1','SpaceShower:MEcorrections = off','TimeShower:pTmaxMatch = 1','TimeShower:pTmaxFudge = 1','TimeShower:MEcorrections = off','TimeShower:globalRecoil = on','TimeShower:limitPTmaxGlobal = on','TimeShower:nMaxGlobalRecoil = 1','TimeShower:globalRecoilMode = 2','TimeShower:nMaxGlobalBranch = 1','TimeShower:weightGluonToQuark = 1'}
})
}
The file the was supposed to be read was created in CMSSW_10_6_4, has a externalLHEProducer configured as
Module: externalLHEProducer GEN
PSet id:3de64faceacf866730598c905d987c98
products: {
LHEEventProduct_externalLHEProducer__GEN.
LHERunInfoProduct_externalLHEProducer__GEN.
}
parameters: {
@module_edm_type: string tracked = 'EDProducer'
@module_label: string tracked = 'externalLHEProducer'
@module_type: string tracked = 'ExternalLHEProducer'
args: vstring tracked = {'/cvmfs/cms.cern.ch/phys_generator/gridpacks/UL/13TeV/madgraph/V5_2.6.1/QCD_HT_Binned/QCD_HT100to200/v1/QCD_HT100to200_MG261_NNPDF31_slc7_amd64_gcc700_CMSSW_10_6_0_tarball.tar.xz'}
numberOfParameters: uint32 tracked = 1
outputFile: string tracked = 'cmsgrid_final.lhe'
scriptName: FileInPath tracked = V001 GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh 2 /src/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh
}
dependencies: {
}
and a generator defined as
Module: generator GEN
PSet id:d1f5dc9fcf2ebcb5c9017bef5836a70a
products: {
GenEventInfoProduct_generator__GEN.
GenLumiInfoHeader_generator__GEN.
GenLumiInfoProduct_generator__GEN.
GenRunInfoProduct_generator__GEN.
edmHepMCProduct_generator_unsmeared_GEN.
}
parameters: {
@module_edm_type: string tracked = 'EDFilter'
@module_label: string tracked = 'generator'
@module_type: string tracked = 'Pythia8HadronizerFilter'
comEnergy: double tracked = 13000
PythiaParameters: PSet tracked = ({
parameterSets: vstring tracked = {'pythia8CommonSettings','pythia8CP5Settings','pythia8PSweightsSettings','processParameters'}
processParameters: vstring tracked = {'JetMatching:setMad = off','JetMatching:scheme = 1','JetMatching:merge = on','JetMatching:jetAlgorithm = 2','JetMatching:etaJetMax = 5.','JetMatching:coneRadius = 1.','JetMatching:slowJetPower = 1','JetMatching:qCut = 14.','JetMatching:nQmatch = 5','JetMatching:nJetMax = 4','JetMatching:doShowerKt = off'}
pythia8CP5Settings: vstring tracked = {'Tune:pp 14','Tune:ee 7','MultipartonInteractions:ecmPow=0.03344','MultipartonInteractions:bProfile=2','MultipartonInteractions:pT0Ref=1.41','MultipartonInteractions:coreRadius=0.7634','MultipartonInteractions:coreFraction=0.63','ColourReconnection:range=5.176','SigmaTotal:zeroAXB=off','SpaceShower:alphaSorder=2','SpaceShower:alphaSvalue=0.118','SigmaProcess:alphaSvalue=0.118','SigmaProcess:alphaSorder=2','MultipartonInteractions:alphaSvalue=0.118','MultipartonInteractions:alphaSorder=2','TimeShower:alphaSorder=2','TimeShower:alphaSvalue=0.118','SigmaTotal:mode = 0','SigmaTotal:sigmaEl = 21.89','SigmaTotal:sigmaTot = 100.309','PDF:pSet=LHAPDF6:NNPDF31_nnlo_as_0118'}
pythia8CommonSettings: vstring tracked = {'Tune:preferLHAPDF = 2','Main:timesAllowErrors = 10000','Check:epTolErr = 0.01','Beams:setProductionScalesFromLHEF = off','SLHA:keepSM = on','SLHA:minMassSM = 1000.','ParticleDecays:limitTau0 = on','ParticleDecays:tau0Max = 10','ParticleDecays:allowPhotonRadiation = on'}
pythia8PSweightsSettings: vstring tracked = {'UncertaintyBands:doVariations = on','UncertaintyBands:List = {isrRedHi isr:muRfac=0.707,fsrRedHi fsr:muRfac=0.707,isrRedLo isr:muRfac=1.414,fsrRedLo fsr:muRfac=1.414,isrDefHi isr:muRfac=0.5,fsrDefHi fsr:muRfac=0.5,isrDefLo isr:muRfac=2.0,fsrDefLo fsr:muRfac=2.0,isrConHi isr:muRfac=0.25,fsrConHi fsr:muRfac=0.25,isrConLo isr:muRfac=4.0,fsrConLo fsr:muRfac=4.0,fsr_G2GG_muR_dn fsr:G2GG:muRfac=0.5,fsr_G2GG_muR_up fsr:G2GG:muRfac=2.0,fsr_G2QQ_muR_dn fsr:G2QQ:muRfac=0.5,fsr_G2QQ_muR_up fsr:G2QQ:muRfac=2.0,fsr_Q2QG_muR_dn fsr:Q2QG:muRfac=0.5,fsr_Q2QG_muR_up fsr:Q2QG:muRfac=2.0,fsr_X2XG_muR_dn fsr:X2XG:muRfac=0.5,fsr_X2XG_muR_up fsr:X2XG:muRfac=2.0,fsr_G2GG_cNS_dn fsr:G2GG:cNS=-2.0,fsr_G2GG_cNS_up fsr:G2GG:cNS=2.0,fsr_G2QQ_cNS_dn fsr:G2QQ:cNS=-2.0,fsr_G2QQ_cNS_up fsr:G2QQ:cNS=2.0,fsr_Q2QG_cNS_dn fsr:Q2QG:cNS=-2.0,fsr_Q2QG_cNS_up fsr:Q2QG:cNS=2.0,fsr_X2XG_cNS_dn fsr:X2XG:cNS=-2.0,fsr_X2XG_cNS_up fsr:X2XG:cNS=2.0,isr_G2GG_muR_dn isr:G2GG:muRfac=0.5,isr_G2GG_muR_up isr:G2GG:muRfac=2.0,isr_G2QQ_muR_dn isr:G2QQ:muRfac=0.5,isr_G2QQ_muR_up isr:G2QQ:muRfac=2.0,isr_Q2QG_muR_dn isr:Q2QG:muRfac=0.5,isr_Q2QG_muR_up isr:Q2QG:muRfac=2.0,isr_X2XG_muR_dn isr:X2XG:muRfac=0.5,isr_X2XG_muR_up isr:X2XG:muRfac=2.0,isr_G2GG_cNS_dn isr:G2GG:cNS=-2.0,isr_G2GG_cNS_up isr:G2GG:cNS=2.0,isr_G2QQ_cNS_dn isr:G2QQ:cNS=-2.0,isr_G2QQ_cNS_up isr:G2QQ:cNS=2.0,isr_Q2QG_cNS_dn isr:Q2QG:cNS=-2.0,isr_Q2QG_cNS_up isr:Q2QG:cNS=2.0,isr_X2XG_cNS_dn isr:X2XG:cNS=-2.0,isr_X2XG_cNS_up isr:X2XG:cNS=2.0}','UncertaintyBands:nFlavQ = 4','UncertaintyBands:MPIshowers = on','UncertaintyBands:overSampleFSR = 10.0','UncertaintyBands:overSampleISR = 10.0','UncertaintyBands:FSRpTmin2Fac = 20','UncertaintyBands:ISRpTmin2Fac = 1'}
})
}
dependencies: {
LHEEventProduct_externalLHEProducer__GEN.
}
Clearly, the merge did NOT read the correct file. However, the file was chosen by AAA, not by the framework.
Well that's horrifying. AAA basically takes the site at its word that it is serving up the correct file. Can we involve the Brunel sysadmins?
From what Chris posted above (from the logs too):
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
how come it says original site is T2_IT_Pisa when the job is running at T2_TW_NCHC (thus, unmerged files are meant to be available over there as well).
It just came to my mind - it's very very unlikely but not impossible - that there could have been another job that ran at Brunel, staged that same file name out but failed at a later stage. If the cleanup job didn't manage to clean that file up, then well, it can be served through xrootd...
It just came to my mind - it's very very unlikely but not impossible - that there could have been another job that ran at Brunel, staged that same file name out but failed at a later stage. If the cleanup job didn't manage to clean that file up, then well, it can be served through xrootd...
It looks like it didn't happen. The job responsible for producing that file with lumi 917 (in vocms0253) succeeded in its first attempt. So there is no way a job could have left a sick file behind!
@amaltaro wrote
The job responsible for producing that file with lumi 917 (in vocms0253) succeeded in its first attempt.
Did you mean lumi 917 for this workflow? As far as I can tell, the file read containing lumi 917 was NOT from this workflow since its provenance is completely different (not even the same CMSSW release). My suggestion would be to try to find which file was actually read based on the configuration of the ExternalLHEProducer and Pythia8.
Yes, there were 2 files with the lumi section 917. One from vocms0253 (the correct/valid one), the other from vocms0254 (which has already been invalidated).
Yes, I made a ggus ticket involving the Brunel site admins: https://ggus.eu/index.php?mode=ticket_info&ticket_id=144246
let's see if we can track down where that file comes from and why it got served under that path.
At least we can conclude there is nothing wrong with WMCore/WMAgent here.
The conclusions so far are a little scary. If one need more examples, just look under https://cms-unified.web.cern.ch/cms-unified/assistance.html#assistance-filemismatch all those have add recent file invalidations because of duplicates
@Dr15Jones are you sure of CMSSW_10_2_16_patch2 and not CMSSW_10_2_16_patch1 ? I found https://cms-pdmv.cern.ch/mcm/requests?page=0&dataset_name=GluGluToHToTauTau*&cmssw_release=CMSSW_10_2_16_patch1 but nothing for patch2. the path /cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc630/13TeV/madgraph/V5_2.6.0/GluGluToHToTauTau/v1/GluGluToHToTauTauPlusTwoJets_M125_13TeV_amcatnloFXFX_pythia8_slc6_amd64_gcc630_CMSSW_9_3_8_tarball.tar.xz does not seem to follow the agreed naming convention. It almost look like the file that was served to the merge job, was NOT from central production ...
@vlimant Doing
edmProvDump -t root://cmsxrootd.fnal.gov//store/mc/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40001/3B506A4B-B36A-374A-B87A-CB2FA045AAE4.root
gives
-------------------------
Processing History:
SIM '' '"CMSSW_10_2_16_patch2"' [1] (17f0e17ab95b7fbef8bfd448939f3e9d)
---------Producers with data in file---------
I'd gladly hand the rest of this debugging (xrootd/site related, file chasing) to someone else... in any case, here are ggus tickets created for Pisa and for NCHC, hoping we can get some useful information: https://ggus.eu/?mode=ticket_info&ticket_id=144296&come_from=submit https://ggus.eu/?mode=ticket_info&ticket_id=144297&come_from=submit
I still find it hard to debug when I can't even make sense of this log:
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
which to me seems pretty much misleading! What does original site
means here? How XRootD has this information? Is brunel.ac.uk
the storage that the data comes from? Or is it another component of the XRootD eco-system (redirector or something like that?)?
From the FrameworkJobReport.xml
of 6420a2d1-6f71-4aea-bbb7-955e67a1db05-0-2-logArchive.tar.gz
you can see the GUID of the opened file, and already it doesn't match the LFN (which I think it should based on looking at other FrameworkJobReport examples):
<InputFile>
<State Value="closed"/>
<LFN>/store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root</LFN>
<PFN>root://xrootd-cms.infn.it//store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027</GUID>
So I went and searched for B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027
, and found it in the WMArchive records on hadoop! There was a production job running at T2_UK_London_Brunel
on that produced an output file with PFN davs://dc2-grid-64.brunel.ac.uk/dpm/brunel.ac.uk/home/cms/store/unmerged/RunIIFall18wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCP5_13TeV-mcatnloFXFX-pythia8/GEN-SIM/102X_upgrade2018_realistic_v11-v1/230000/B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.root
. It has one lumisection, 917, and from what it looks like, all the metadata matches up with the provenance that Chris found. Stage-out finished at 2019-11-05 01:43:10
(not sure the timezone). For comparison, the job in Taiwan started reading the file at 06-Nov-2019 17:57:55 UTC
.
Later, there was a reconstruction job that also ran at Brunel and successfully read in this file using PFN root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/unmerged/RunIIFall18wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCP5_13TeV-mcatnloFXFX-pythia8/GEN-SIM/102X_upgrade2018_realistic_v11-v1/230000/B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.root
.
I put both records here: http://ncsmith.web.cern.ch/ncsmith/who_made_B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.json
I think the next step is to understand with Brunel and xrootd-cms.infn.it
(Pisa?) whether there are logs of a request for B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027
or 81CB9803-B71C-464B-AD3F-5B53605EF41E
to see where the filename-file mismatch happened.
@amaltaro - to decode this message:
Data is served from brunel.ac.uk instead of original site T2_IT_Pisa
This means that CMSSW tried to open the file at xrootd-cms.infn.it
(the European redirector) which identified itself as named T2_IT_Pisa
. Then it was redirected to a server at brunel.ac.uk
(which must not have a site name set; DNS domain name is used in such a case).
I don't think there's much use in asking NCHC what's wrong. Either DPM at Brunel or the cmsd at the Pisa redirector failed massively here.
@vlimant - there are some interesting other files in the Unified link you sent
particularly, from here:
can you pull the logs corresponding to HIG-RunIISummer19UL18wmLHEGEN-00233_0MergeRAWSIMoutput:8023
?
Particularly, that job caught my eye because its error says the product named int_addPileupInfo_bunchSpacing_HLT.
is in file /store/unmerged/RunIISummer19UL18wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/240000/468512E6-88D8-5C44-8A7E-344184CC62F3.root'
.
Why are various HLT, RECO, and SIM products in a GEN file?!?
there are no further log handy for 8023 error code indeed. one has to look in wmarchive
@nsmith is making progress in writing a script in automatically detecting these. In November, he's found 2,000 files that were incorrectly delivered; 100 resulted in successful job output.
One successfully merged file is /store/mc/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30000/0A4DEA82-E369-2E4A-902B-E762A4B3BACA.root
-- which still exists in Xrootd if anyone wants to take a look.
(I should note that the incorrect, unmerged file in the case above came from Brunel as well)
This is all performed on the November FrameworkJobReport logs in hadoop, /cms/wmarchive/avro/fwjr/2019/11/*/*.avro
. Since the WMArchive on hadoop also records the GUID read from the input file upon opening, I wrote something to compare that field to the value extracted from PFN/LFN. At first there were many mismatches, but a large class of them turned out to be a bug in the way cmssw handles DQM files, see https://github.com/cms-sw/cmssw/issues/28534
So after removing those, we are left with about 3000 files, where the PFN always starts with root protocol (no local, etc.)
prefix
root://cms-xrootd.gridpp.ac.uk/ 83
root://cms-xrootd.gridpp.ac.uk:1094/ 25
root://cmsxrootd.fnal.gov/ 384
root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms 54
root://heplnv139.pp.rl.ac.uk:1094/ 131
root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms 1
root://node12.datagrid.cea.fr/ 26
root://polgrid4.in2p3.fr/ 37
root://se01.indiacms.res.in/ 1
root://xrootd-cms.infn.it/ 1712
root://xrootd-cms.infn.it:1194/ 307
root://xrootd-es.pic.es:1096/ 151
root://xrootd.t2.ucsd.edu:2040/ 12
Name: pfn, dtype: int64
There is some probability that the same collection of job reports also includes the record of producing the input file which was actually read by the process. This can be found by searching for the output guid. So I did an inner join on that, still only considering November, and found about 600 entries. Now we know what site produced the file. If we filter out any files that are not /store/unmerged
, we can be sure that the site is also the only possible site the file could be read from. This leaves:
fwjr_site
T2_FR_GRIF_IRFU 2
T2_FR_GRIF_LLR 4
T2_PL_Swierk 9
T2_UK_London_Brunel 670
T2_UK_SGrid_RALPP 1
Name: pfn, dtype: int64
I've inspected a few cases in detail. The one Brian mentions above appears quite similar to the case we were originally analyzing: the job at Nebraska appeared to open a file from Brunel that should have been at Nebraska:
11-Nov-2019 23:42:16 UTC Initiating request to open file /mnt/hadoop/user/uscms01/pnfs/unl.edu/data4/cms/store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root
11-Nov-2019 23:42:16 UTC Fallback request to file root://cmsxrootd.fnal.gov//store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root
%MSG-w XrdAdaptor: file_open 11-Nov-2019 23:42:26 UTC pre-events
Data is served from brunel.ac.uk instead of original site US-Regional-Nebraska
%MSG
11-Nov-2019 23:42:39 UTC Successfully opened file root://cmsxrootd.fnal.gov//store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root
and in the FrameworkJobReport we see the GUID mismatch:
<InputFile>
<State Value="closed"/>
<LFN>/store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root</LFN>
<PFN>root://cmsxrootd.fnal.gov//store/unmerged/RunIISummer19UL18wmLHEGEN/JJH0Mf05ph0ToBB_M-125_TuneCP5_13TeV-JHUGenV7011-pythia8/GEN/106X_upgrade2018_realistic_v4-v1/30004/16EE2D4B-1AFE-C24A-8E95-CF7FEEB367E8.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>1088D615-B304-EA11-8B83-C45444922B77</GUID>
<Branches>
<Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<InputType>primaryFiles</InputType>
<InputSourceClass>PoolSource</InputSourceClass>
<EventsRead>1524</EventsRead>
<Runs>
<Run ID="1">
<LumiSection ID="194539"/>
</Run>
</Runs>
</InputFile>
The file with that GUID is
davs://dc2-grid-64.brunel.ac.uk/dpm/brunel.ac.uk/home/cms/store/unmerged/RunIISummer15wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCUETP8M1_13TeV-mcatnloFXFX-pythia8/LHE/MCRUN2_71_V1-v1/60038/1088D615-B304-EA11-8B83-C45444922B77.root
produced by a job at Brunel. Logs for the Nebraska job can be found in 06aaba88-074f-4f18-bd60-3984faff4f38-30-0-logArchive.tar.gz
inside the LogArchive at
root://eoscms.cern.ch//store/logs/prod/2019/11/WMAgent/pdmvserv_task_HIG-RunIISummer19UL18wmLHEGEN-00231__v1_T_191030_054827_8689/pdmvserv_task_HIG-RunIISummer19UL18wmLHEGEN-00231__v1_T_191030_054827_8689-HIG-RunIISummer19UL18wmLHEGEN-00231_0RAWSIMoutputMergeLogCollect-rice-a461-11-logs.tar
A more interesting case is a job at T2_UK_SGrid_RALPP
that attempted to read a file but instead ended up opening a file that it had opened moments earlier while running:
26-Nov-2019 13:17:57 UTC Initiating request to open file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root
[2019-11-26 13:20:58.760667 +0000][Error ][XRootD ] [[2001:630:58:1c20::82f6:2fb6]:50925] Unable to get the response to request kXR_read (handle: 0x00000000, offset: 0, size: 300)
[2019-11-26 13:21:10.060894 +0000][Error ][File ] [0xcee0a980@root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root] Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 0, size: 300) returned with [ERROR] Operation expired
%MSG-w XrdAdaptorInternal: (NoModuleName) 26-Nov-2019 13:21:10 UTC pre-events
XrdRequestManager::handle(name='root://mover.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root) failure when reading from [2001:630:58:1c20::82f6:2fb6]:50925 (unknown site); failed with error '[ERROR] Operation expired' (errno=0, code=206).
%MSG
%MSG-w XrdAdaptorInternal: (NoModuleName) 26-Nov-2019 13:21:10 UTC pre-events
Request failure when reading from [2001:630:58:1c20::82f6:2fb6]:50925 (unknown site)
%MSG
%MSG-w XrdAdaptor: (NoModuleName) 26-Nov-2019 13:21:10 UTC pre-events
Data is now served from instead of previous [2001
%MSG
[2019-11-26 13:21:43.201866 +0000][Error ][AsyncSock ] [[2001:630:58:1c20::82f6:2fb6]:50925 #0.0] Socket error while handshaking: [ERROR] Socket timeout
[2019-11-26 13:22:28.396188 +0000][Error ][AsyncSock ] [[2001:630:58:1c20::82f6:2fb6]:50925 #0.0] Socket error while handshaking: [ERROR] Socket timeout
[2019-11-26 13:22:28.396431 +0000][Error ][PostMaster ] [[2001:630:58:1c20::82f6:2fb6]:50925 #0] Unable to recover: [ERROR] Socket timeout.
[2019-11-26 13:22:28.396502 +0000][Error ][XRootD ] [[2001:630:58:1c20::82f6:2fb6]:50925] Impossible to send message kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=807b3cf2-4af3-4d3d-aebf-8937ae044c84&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e, mode: 0660, flags: kXR_open_read ). Trying to recover.
[2019-11-26 13:22:28.396651 +0000][Error ][XRootD ] [[2001:630:58:1c20::82f6:2fb6]:50925] Impossible to send message kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=fcba3e95-e929-425b-a5fe-d5f99b8fd810&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e&tried=[2001, mode: 0660, flags: kXR_open_read kXR_async kXR_retstat ). Trying to recover.
%MSG-w XrdAdaptorInternal: (NoModuleName) 26-Nov-2019 13:24:20 UTC pre-events
Caught a CMSSW exception when running connection recovery.
%MSG
26-Nov-2019 13:24:20 UTC Fallback request to file root://cms-xrootd.gridpp.ac.uk:1094//store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root
[2019-11-26 13:24:21.065115 +0000][Error ][XRootD ] [mover.pp.rl.ac.uk:1094] Unable to get the response to request kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=fcba3e95-e929-425b-a5fe-d5f99b8fd810&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e&tried=[2001,[2001:630:58:1c20::82f6:2fb6], mode: 0660, flags: kXR_open_read kXR_async kXR_retstat )
[2019-11-26 13:24:21.065739 +0000][Error ][XRootD ] [mover.pp.rl.ac.uk:1094] Unable to get the response to request kXR_open (file: pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root?org.dcache.uuid=807b3cf2-4af3-4d3d-aebf-8937ae044c84&org.dcache.xrootd.client=pltcms07.235@2001:630:58:1c20:0:0:82f6:2d6e&tried=[2001:630:58:1c20::82f6:2fb6], mode: 0660, flags: kXR_open_read )
%MSG-w XrdAdaptor: file_open 26-Nov-2019 13:24:33 UTC PostProcessEvent
Data is served from T2_UK_SGrid_RALPP instead of original site T2_UK_London_IC
%MSG
26-Nov-2019 13:24:33 UTC Successfully opened file root://cms-xrootd.gridpp.ac.uk:1094//store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/E4B64891-6C0F-EA11-B250-A4BF012881D0.root
%MSG-w MemoryCheck: source 26-Nov-2019 13:24:36 UTC PostBeginLumi
MemoryCheck: module source:source VSIZE 3742.29 0.5 RSS 2086.21 -314.145
%MSG
Begin processing the 579th record. Run 1, Event 2069622346, LumiSection 731501 at 26-Nov-2019 13:24:36.252 UTC
... snip ...
26-Nov-2019 13:28:39 UTC Initiating request to open file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/B2F7DE91-6C0F-EA11-94D4-002590DE3AC0.root
26-Nov-2019 13:28:40 UTC Successfully opened file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/B2F7DE91-6C0F-EA11-94D4-002590DE3AC0.root
%MSG-w DuplicateEvent: PostBeginLumi 26-Nov-2019 13:28:40 UTC PostBeginLumi
Duplicate Events found in entire set of input files.
Both events were from run 1 and luminosity block 731501 with event number 2069622346.
The duplicate was from file root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/cms/store/unmerged/RunIISummer15GS/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-photos/GEN-SIM/MCRUN2_71_V1-v1/70010/B2F7DE91-6C0F-EA11-94D4-002590DE3AC0.root.
The duplicate will be skipped.
%MSG
Logs for this job can be found in 076c36c0-9798-4ea5-a711-7836a5c5e6df-6-0-logArchive.tar.gz
inside the LogArchive at
root://eoscms.cern.ch//store/logs/prod/2019/12/WMAgent/pdmvserv_task_SMP-RunIISummer15GS-00275__v1_T_191009_191932_9218/pdmvserv_task_SMP-RunIISummer15GS-00275__v1_T_191009_191932_9218-LogCollectForSMP-RunIISummer16DR80-00068_0-heplnc131-12-logs.tar
There are very few cases involving data in this timeframe, but one is rather interesting and looks like possibly a different root cause. This RAW data file has a different UUID than its LFN:
$ edmFileUtil -u root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root
root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root
root://cmsxrootd.fnal.gov//store/data/Run2018D/SingleMuon/RAW/v1/000/322/179/00001/DE727AEB-44B0-E811-AE5A-FA163E294873.root (1 runs, 2 lumis, 5487 events, 2937042356 bytes, 327FE446-F9B7-E811-9C31-FA163E5394F5 uuid)
In November, he's found 2,000 files that were incorrectly delivered; 100 resulted in successful job output.
does this mean that some samples have events from different physics process than expected ? looks to me that these need to be identified, to form a list that ppd has to resubmit.
does this mean that some samples have events from different physics process than expected
Yes, that's precisely my concern. @nsmith- is there a way to get the list of affected datasets (including parentage?)
yeah with enough spark queries I should be able to make a list, working on it
Ok, I expanded the scope to consider all production job reports from 2018 and 2019. Here is a plot of the frequency of such GUID mismatches per week:
The unknown component contains both events where: the GUID corresponding to the file that was actually opened could not be found in the same records (i.e. non-production file such as user data or a file created before 2018); and where the file was found but it was a file in the catalog (i.e. /store/unmerged/
was not found in the LFN) so that there is no definitive way to know which site actually hosted and served the file at that point, since it could have been replicated by PhEDEx.
I compiled a list of all successful jobs that read at least one event from an incorrectly opened file and produced output, following unmerged output to the first occurrence of an output file in the catalog. They can be found in input-guid-mismatch-cataloged-output.txt. These files, and their children (if any) should be invalidated and recreated if need be.
I looked a bit to see if there was any correlation between when the files were created and accessed, but I think little to no conclusion can be drawn as one mostly sees an exponential tail from the lifetime of unmerged files, on top of a background of access to permanent, cataloged files. There may be some value in also checking CRAB jobreports for the same issue, so I might try that next.
P.S. some highlights include this MC file which is actually data: https://cmsweb.cern.ch/das/request?instance=prod/global&input=run+file%3D%2Fstore%2Fmc%2FRunIISummer19UL17wmLHEGEN%2FST_t-channel_top_4f_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8%2FLHE%2F106X_mc2017_realistic_v6-v2%2F310000%2F9C9CEF31-6407-0E43-B541-ADB37CE714D3.root and this data file which is actually MC: https://cmsweb.cern.ch/das/request?instance=prod/global&input=run+file%3D%2Fstore%2Fdata%2FRun2018B%2FParkingBPH2%2FMINIAOD%2F05May2019-v2%2F30001%2F8DD2A46D-6ABE-6348-8F3B-52C0AD1A55EE.root
@nsmith- Are we ready to invalidate data/datasets? Can we build up a list and send it to comp-ops for review?
Pradeep and I are working on it, just specific files. Once invalidated, we report to PdMV and they can decide if anything need reproducing.
I compiled a list of all successful jobs that read at least one event from an incorrectly opened file and produced output, following unmerged output to the first occurrence of an output file in the catalog. They can be found in input-guid-mismatch-cataloged-output.txt. These files, and their children (if any) should be invalidated and recreated if need be.
I took a look at one of the files on the list /store/data/Run2018B/ParkingBPH2/MINIAOD/05May2019-v2/30001/8DD2A46D-6ABE-6348-8F3B-52C0AD1A55EE.root
Doing
edmProvDump /store/data/Run2018B/ParkingBPH2/MINIAOD/05May2019-v2/30001/8DD2A46D-6ABE-6348-8F3B-52C0AD1A55EE.root
didn't show anything amiss with the histories of the files used when creating this file from a merge. What inconsistency did you find for this file?
One inconsistency is that the path/dataset claims the file to be data, but the edmProvDump
shows that the file is clearly from simulation
Processing History:
SIM '' '"CMSSW_9_3_6"' [1] (785c63f5fd8d7bdccf2c9db5c4a14fea)
HLT '' '"CMSSW_9_4_0_patch1"' [1] (9d8f590933956796cd3e18bfb7642a20)
RECO '' '"CMSSW_9_4_0_patch1"' [1] (351414bbda2cdc38d49da1680cef2a3f)
The rate has gone down significantly this month, but is still nonzero.
The cmssw exit codes that are most probable to occur with the guid mismatch (# errors with guid / # errors total) are:
rate
stepErrorCode
70452.0 0.178694
8026.0 0.138889
8009.0 0.093502
87.0 0.040577
8023.0 0.035229
No 8034 (the specific exit code for the mismatch) in there yet?
So it seems. I believe it won't be in production until https://github.com/dmwm/WMCore/issues/9468 is implemented
one new such (among others) workflows
with the potential of screwing up with so many signal samples in analysis, this should have a little higher priority (higher than high priority) if you ask me
Impact of the bug WMAGent
Describe the bug @vlimant reported via slack that this workflow pdmvserv_task_HIG-RunIISummer19UL17wmLHEGEN-00493__v1_T_191030_134608_301 has had duplicate lumi sections. Even though there are no other workflows (ACDC and alike) writing to the same output datasets. We need to investigate it and find out whether: a) the agent really assigned the same lumi section to multiple jobs b) or was it another problem (maybe something wrong with a merge) c) find the jobs that produced those lumis, check the logs and the FJR
How to reproduce it no clue!
Expected behavior No duplicate lumi sections should ever be created!
Additional context and error message Report from Unified: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-RunIISummer19UL17wmLHEGEN-00493&module=checkor&limit=50&size=1000
Dimas page: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-RunIISummer19UL17wmLHEGEN-00493