Closed vlimant closed 5 years ago
The only thing I found in the agent (vocms0116) were logs like these [1]. @amaltaro do you think these errors could cause duplicates?
[1] install/wmagent/JobAccountant/ComponentLog:2016-11-03 20:34:08,337:140528953976576:ERROR:AccountantWorker:Bad FwkJobReport Path: /data/srv/wmagent/v1.0.21.patch2/install/wmagent/JobCreator/JobCache/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0161102_162051_8417/Production/JobCollection_11079_0/job_119483/Report.0.pkl install/wmagent/ErrorHandler/ComponentLog:2016-11-03 20:36:36,076:140194997511936:ERROR:ErrorHandlerPoller:Failed to find FWJR for job 119483 in location /data/srv/wmagent/v1.0.21.patch2/install/wmagent/JobCreator/JobCache/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0161102_162051_8417/Production/JobCollection_11079_0/job_119483/Report.0.pkl
Please keep this workflow active if possible. I'll try to have a look at the job configurations to see if they were really meant to produce the same lumi sections (bear in mind this week is not very productive in terms of real work).
Just in case, these 2 different files really produced the same lumi section 7591 (as cryptically reported above): https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root
7592 is also duplicated
We won't be able to keep the workflow in that status forever, as we have to deliver it. Any chance you can grab the information you need from the db to understand this before we deliver this back to analysis ?
Hi Jean-Roch, you can process I copied problem job reports here /afs/cern.ch/user/c/cmst1/test/duplicatelumi
Problem seems to be there is difference in fwjr and wmbs record. (vocms0116)
Looking at the wmbs, following is unmerged file containing (7591, 7592) lumis.
id=378511 /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root
This file contains the above unmerged file. id=565303 (merged one) created by job 280836 /store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root
But first one does not. (in webs_file_parent table) id=395554 created by job 266543
/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root'
But both of merged files contains duplicate lumis although second job doesn't have the input file which contain the lumis.
select * from wmbs_file_runlumi_map where fileid=395554 and lumi in (7591, 7592); select * from wmbs_file_runlumi_map where fileid=565303 and lumi in (7591, 7592);
Looking at the fwjr for job 266543, following input file seems to have lumis (7591, 7592)
id=284841 /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root
But wmbs has different record, (lump 7499, 7500)
select run, lumi from wmbs_file_runlumi_map wm inner join wmbs_file_details wd on wm.fileid=wd.id where wd.lfn='/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root';
1 7499 1 7500
Looking at the source code. This is where input file information is added
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/XMLParser.py#L120 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/Report.py#L493
I am not completely sure how CMSSW gets the run lumi information. But it looks like there there is difference between CMSSW framework report run lumi information and wmbs run lumi information. There might some edge case masking problem since lumi numbers are only off by 2. (CMSSW side?)
Here is the partial fwjr which contains the same lumi section
for job 266543-0
"runs": {
"1": [
7591,
7592
]
},
"input_source_class": "PoolSource",
"input_type": "primaryFiles",
"lfn": "/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root",
"pfn": "/mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root",
"catalog": "",
"module_label": "source",
"guid": "A685DDD8-5CA1-E611-9020-A0000420FE80",
"events": 323
For job 280836-0
{
"runs": {
"1": [
7591,
7592
]
},
"input_source_class": "PoolSource",
"input_type": "primaryFiles",
"lfn": "/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root",
"pfn": "root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root",
"catalog": "",
"module_label": "source",
"guid": "BEA7849C-FDA1-E611-B937-0CC47AD98B90",
"events": 323
},
@prozober please do. @ticoann that could be worth passing on the edm people over HN @Dr15Jones
Do you have logs and configuration files for the jobs which produced output with the same lumis?
Hi Chris, Here is the config file.
https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/0e64d45e2cc39207afb73e7fd4bb62f4/configFile
Right now I only have a condor logs, @amaltaro, Alan, does logArchive job contains CMSSW log? Anyway, logArchive contains cmssw log, here is pfns - (just for tracking purpose). (It might be already deleted)
srm://srm.rcac.purdue.edu:8443/srm/v2/server?SFN=/mnt/hadoop//store/unmerged/logs/prod/2016/11/4/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/Production/ProductionMergeLHEoutput/0000/0/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz
srm://cmssrm.hep.wisc.edu:8443/srm/v2/server?SFN=/hdfs/store/unmerged/logs/prod/2016/11/4/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/Production/ProductionMergeLHEoutput/0000/0/4aadd12e-a2af-11e6-82a7-02163e017c3c-2-0-logArchive.tar.gz
If they are deleted, we can dig into logArchive tar ball containing above logs.
Let me know if you want me to send the log files via email. Thanks,
Yes, and it should also contain the original FJR (and the PSet)
On Nov 10, 2016, at 9:42 PM, ticoann notifications@github.com wrote:
Let me know if you want me to send the log files via email.
If you can make the log files, FJR and the PSet available (can just be in AFS or on the web) that would be very helpful.
Chris
Christopher Jones Fermi National Accelerator Laboratory cdj@fnal.gov
Of course the files are already deleted. :-(.
I think these are log files containing above logArchive files. I am trying to copy the log I am not sure how big these are whether I have enough space. Alan do you usually copy in your home dir? (I just need to use xrdcp command, right? do you know the command I can check the size of file?)
Thanks,
root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2016/11/WMAgent/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417-ProductionLHEoutputMergeLogCollect-g31n41-3-logs.tar
root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2016/11/WMAgent/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417-ProductionLHEoutputMergeLogCollect-cms-h003-11-logs.tar
Seangchan, these are the Merge logcollects and hopefully there is only one tarball inside each of them. I copied their logArchiver over here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/4aadd12e-a2af-11e6-82a7-02163e017c3c-2-0-logArchive.tar.gz
where you can find information for the 2 LFNs you previously mentioned:
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root
According to the Merge FJR, indeed two files had lumi sections 7591 and 7592.
I asked Valentin/Yuyi to search for the logArchive where those 2 files were created.
And these are the logs that produced the unmerged files: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-795-1-logArchive.tar.gz
It can be seen in these FJRs, but indeed one job produced a file with lumis 7499 and 7500 and the other job produced a file with lumis 7591 and 7592.
Seangchan, WMCore does not provide run/lumi mask for merge jobs, does it? We just provide the unmerged files to be merged and the framework takes care of it AND creates the FJR, that is then used by the agent for publishing this data and so on. Am I right?
IF so, then what needs to be figured out is why CMSSW wrote the wrong lumi sections into the FJR.
Thanks Alan for the logAchive files.
Seangchan, WMCore does not provide run/lumi mask for merge jobs, does it? We just provide the unmerged files to be merged and the framework takes care of it AND creates the FJR, that is then used by the agent for publishing this data and so on. Am I right?
Yes you are right. We don't provide run/lumi mask. @Dr15Jones, how CMSSW gets the run lumi information from unmarked file. (Is the file contains that information?)
Here are the log files Alan posted.
merge job http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/4aadd12e-a2af-11e6-82a7-02163e017c3c-2-0-logArchive.tar.gz
production job
http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-795-1-logArchive.tar.gz
Hi everyone,
On Nov 11, 2016, at 5:27 PM, ticoann notifications@github.com wrote:
Thanks Alan for the logAchive files.
Seangchan, WMCore does not provide run/lumi mask for merge jobs, does it? We just provide the unmerged files to be merged and the framework takes care of it AND creates the FJR, that is then used by the agent for publishing this data and so on. Am I right? Without a mask the framework would merge all Lumis into the output.
Yes you are right. We don't provide run/lumi mask. @Dr15Jones, how CMSSW gets the run lumi information from unmarked file. (Is the file contains that information?) The LuminosityBlocks are known internally to the file. There really is no way the framework can misread that information.
Looking at the logs from the jobs, the framework does in deed say the files in question have the same luminosity blocks. The following are excerpts from the merge job cmsRun logs:
--------------------------merge a------------- 04-Nov-2016 11:58:16 CDT Successfully opened file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root %MSG-w MemoryCheck: source 04-Nov-2016 11:58:17 CDT PostBeginLumi MemoryCheck: module source:source VSIZE 550.98 2.00391 RSS 158.156 2.98438
%MSG Begin processing the 22934th record. Run 1, Event 1225786, LumiSection 7591 at 04-Nov-2016 11:58:17.120 CDT ... Begin processing the 23256th record. Run 1, Event 1226108, LumiSection 7592 at 04-Nov-2016 11:58:17.302 CDT 04-Nov-2016 11:58:17 CDT Closed file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root 04-Nov-2016 11:58:17 CDT Initiating request to open file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F6C514D5-5CA1-E611-857C-003048F5B69A.root 04-Nov-2016 11:58:17 CDT Successfully opened file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F6C514D5-5CA1-E611-857C-003048F5B69A.root %MSG-w MemoryCheck: source 04-Nov-2016 11:58:17 CDT PostBeginLumi MemoryCheck: module source:source VSIZE 552.984 2.00391 RSS 159.082 0.464844
%MSG Begin processing the 23257th record. Run 1, Event 1227078, LumiSection 7599 at 04-Nov-2016 11:58:17.543 CDT ... ----------------------------merge b----------- 03-Nov-2016 22:25:19 EDT Successfully opened file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root Begin processing the 111759th record. Run 1, Event 1225786, LumiSection 7591 at 03-Nov-2016 22:25:19.258 EDT
Begin processing the 112081st record. Run 1, Event 1226108, LumiSection 7592 at 03-Nov-2016 22:25:19.352 EDT 03-Nov-2016 22:25:19 EDT Closed file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root 03-Nov-2016 22:25:19 EDT Initiating request to open file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/4203A3B8-5CA1-E611-A8FE-24BE05CE4D91.root 03-Nov-2016 22:25:19 EDT Successfully opened file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/4203A3B8-5CA1-E611-A8FE-24BE05CE4D91.root %MSG-w MemoryCheck: source 03-Nov-2016 22:25:19 EDT PostBeginLumi MemoryCheck: module source:source VSIZE 547.305 0 RSS 298.508 0.40625
%MSG Begin processing the 112082nd record. Run 1, Event 1211251, LumiSection 7501 at 03-Nov-2016 22:25:19.431 EDT
Is it possible to get access to the files in question? How do you know that the files here match the logs for the production?
Chris
Thanks Chris for looking into the problem,
Is it possible to get access to the files in question?
I think it is all deleted again. I tried to get it from worker node but wasn't there. And we don't keep the unmerged file some where else. (@amaltaro, Alan, that is correct, right?)
How do you know that the files here match the logs for the production?
We keep track in agent side as well.
To summarize what how the accounting process goes in agent.
What we found is basically unmatched record in following file.
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root
The FWJR for the production job for above output file contains run 1, lumi 7499, 7500. But FWJR for the merge job including above file indicates it has run 1, lumi 7591, 7592.
Although it is the same file. I am guessing second information (FWJR from merge job) is incorrect. But I am not sure how CMSSW gets that information and how the difference could happen.
If you look at the fwjr from it says lumi section, 7499 and 7500
<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root</LFN>
<PFN>LHEoutput.root</PFN>
<Catalog></Catalog>
<ModuleLabel>LHEoutput</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
<Branches>
<Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<OutputModuleClass>PoolOutputModule</OutputModuleClass>
<TotalEvents>323</TotalEvents>
<DataType>MC</DataType>
<BranchHash>baad18a51a4cbab54f85db48954b2717</BranchHash>
<Runs>
<Run ID="1">
<LumiSection ID="7499"/>
<LumiSection ID="7500"/>
</Run>
but from merge Job fwjr for the same file lumisections are 7591, 7592 http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz
<InputFile>
<State Value="closed"/>
<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root</LFN>
<PFN>/mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
<Branches>
<Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<InputType>primaryFiles</InputType>
<InputSourceClass>PoolSource</InputSourceClass>
<EventsRead>323</EventsRead>
<Runs>
<Run ID="1">
<LumiSection ID="7591"/>
<LumiSection ID="7592"/>
</Run>
On Nov 11, 2016, at 9:54 PM, ticoann notifications@github.com wrote:
The FWJR for the production job for above output file contains run 1, lumi 7499, 7500. But FWJR for the merge job including above file indicates it has run 1, lumi 7501, 7502. I believe you mistyped and meant 7591 and 7592 (since that is what is shown in the cmsRun log file).
Although it is the same file. That is actually an assumption. We do not have any proof it is actually the same file since we can't look at it.
I am guessing second information (FWJR from merge job) is incorrect. No, that information is correct. The framework reads to LuminosityBlock information from the file. From the actually cmsRun log file output we can see that the framework read the LuminosityBlock as 7591 and 7592 which does not match with what the log file for production said the job which supposedly created that file used as LuminosityBlock numbers.
The most likely scenario to me is either 1) the production log file you pointed me to was not the one that created that file OR 2) the file isn't the file the system thinks it is
Either is pretty scary.
I do not see any sign this is a framework problem as the log files produced by cmsRun match exactly with the FrameworkJobReport for those same logs.
Chris
No, we don't keep the unmerged files. Once they are merged, there is a cleanup job that comes in and thrown then into the trash bin.
I do not see any sign this is a framework problem as the log files produced by cmsRun match exactly with the FrameworkJobReport for those same logs.
I agree, for the production jobs. However, each of those two merge jobs read these files and in their fjr they report different lumi sections.
I believe you mistyped and meant 7591 and 7592 (since that is what is shown in the cmsRun log file).
Sorry my mistake. updated. :smile:
The most likely scenario to me is either 1) the production log file you pointed me to was not the one that created that file OR 2) the file isn't the file the system thinks it is
I am still confused how this could explain the difference between fwjr from unmarged file and merged file result for exact the same file. (Maybe 2) does.
This is weird since framework reads Lumi information directly from file, both should be the same. (Unless there is uuid collision and same name of the file is create after which is unlikely since we have the file with those lumis).
On Nov 11, 2016, at 10:26 PM, Alan Malta Rodrigues notifications@github.com wrote:
No, we don't keep the unmerged files. Once they are merged, there is a cleanup job that comes in and thrown then into the trash bin.
I do not see any sign this is a framework problem as the log files produced by cmsRun match exactly with the FrameworkJobReport for those same logs.
I agree, for the production jobs. However, each of those two merge jobs read these files and in their fjr they report different lumi sections.
In all cases the log file output of cmsRun matches the framework job report made by that same cmsRun job. So that backs the claim that the LuminosityBlock recorded in the file being read is what is in the log file.
Therefore the most likely conclusion is the file written by the production job is not actually the same file being read by the merge job.
Do you have the final merge files? In principal we could look at the values for the events in those 'same' LuminosityBlocks and see if they are actually the same events.
Chris
On Nov 11, 2016, at 10:27 PM, ticoann notifications@github.com wrote:
This is weird since framework reads Lumi information directly from file, both should be the same. (Unless there is uuid collision and same name of the file is create after which is unlikely since we have the file with those lumis).
Looking at the configuration file given to cmsRun, the framework is told to use a very generic name for the output file /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root
something in the workflow management system then renames the file. Therefore we actually have no idea if the file read by the merge job is actually the one produced by the production job.
Chris
The production file is not the same file as used in the merge file. The GUIDs do not match.
From the FJR of the production job we have
Hi Chris, I think you are looking at the wrong file. GUUI matches with Production job. (A685DDD8-5CA1-E611-9020-A0000420FE80)
<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root</LFN>
<PFN>LHEoutput.root</PFN>
<Catalog></Catalog>
<ModuleLabel>LHEoutput</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
<Branches>
<Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<OutputModuleClass>PoolOutputModule</OutputModuleClass>
<TotalEvents>323</TotalEvents>
<DataType>MC</DataType>
<BranchHash>baad18a51a4cbab54f85db48954b2717</BranchHash>
<Runs>
<Run ID="1">
<LumiSection ID="7499"/>
<LumiSection ID="7500"/>
</Run>
Looking at the configuration file given to cmsRun, the framework is told to use a very generic name for the output file /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root
something in the workflow management system then renames the file. Therefore we actually have no idea if the file read by the merge job is actually the one produced by the production job.
Is that mean actually file is created in generic name and renamed later? Actual file name is created by LHEoutput to GUID in fwjr.
<ModuleLabel>LHEoutput</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
Isn't the causing the race condition if two jobs putting the files in the same directory? Is that that agent code which rename the file?
On Nov 11, 2016, at 11:02 PM, ticoann notifications@github.com wrote:
Hi Chris, I think you are looking at the wrong file. GUUI matches with Production job. (A685DDD8-5CA1-E611-9020-A0000420FE80)
The confusion was there are two files written by the production job, both with very similar GUIDs
and
After talking to Chris, Eric,
I am guessing this is what happens (but I don't have concrete evidence) For production job - 266543
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root
In this case there could a race condition there Other job can writ the the same file before 2 happens.
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root
Basically first job which wrote to that path is overwritten by second job before it gets renamed. So first unmerged output file is lost and second unmerged output file is duplicated to different files. In our case,
/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root
and
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root
To fix this, following patch might work. https://github.com/dmwm/WMCore/commit/84cfa939ee56733e8e6fa2b026ae69889b8b740e
But I can't still find where the step 2 happen which would affect the patch. @amaltaro, do you know where the code resides doing that?
Just to clarify, the framework does not copy the unmerged file. The framework just writes the file in the local directory of the worker node. The copy is done by the workflow system.
The file
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root
never exists. It's a bogus entry in the some intermediate reports that is then supplemented by the stageout code which creates the real file on the mass storage. Are you saying we actually wrote an LHEoutput.root file to mass storage ???
Chris told me that (/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root) is what cmssw framework writes. Not the unmerged file which exists. (/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root)
@Dr15Jones, Chris, did I misunderstood what you told me?
The framework writes ./LHEoutput.root
.
but where? which parameter is it using for current directory?
I'm pretty sure ./LHEoutput.root
is written in the pilot scratch area. When it's moved to the storage, it's moved with a UID.
The GUID makes it into the report here
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/FileInfo.py#L72
This is also where we override the report to account for direct stageout to merged.
One comment, this assume that the report already contains
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root
so either @Dr15Jones is wrong about the ./LHEoutput.root or we have yet another piece of wrapper code.
And yes, we write LHEoutput.root into the current working directory, but that doesn't neccessarily mean that the FJR also reports ./LHEoutput.root.
Hmm then, I am not sure how this happens. pilot scratch area is unique to a job, right? Then where following parameter is used https://github.com/dmwm/WMCore/blob/master/src/python/PSetTweaks/WMTweak.py#L509
I believe the FJR reports the 'logical' file name as passed to it from the configuration.
Yep, the LFN reported by the FJR has NOTHING to do with where the job puts the file on the local filesystem. It's a LFN that only has meaning for the stageout as a remote directory.
@hufnagel, so PFN below is the current working (job running directory).
<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root</LFN>
<PFN>LHEoutput.root</PFN>
and stage out moves the file to directly to below pin, (meaning file name is not renamed after file is copied but before file is copied, correct?)
<PFN>/mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root</PFN>
No. Current working directory is unknown, no idea what it is. We don't need to know.
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/
has no meaning whatsoever as far as the worker node file system is concerned. It's a string, nothing more.
That string + GUID IS the stageout location though.
Renamed is the wrong word here. These are all just string manipulations, the first time this corresponds to an actual path and file on a disk somewhere is for the stageout target.
Ie. LFN from the CMSSW FJR does NOT correspond to an actual file somewhere, it's completely bogus. It's just a way for us to forward the UnmergedLFNBase from the workflow configuration to the FJR.
Not sure why this was implemented like this. Before my time. We could also have skipped the LFN completely and in
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/FileInfo.py#L72
constructed the LFN from scratch by doing UnmergedLFNBase+GUID.
Two different pilots never run in the same work space (directory). This is correct statement, right? So now I don't know how this happened, then. 😞
Correct.
I haven't had time to follow all this, but the LHEOutput.root thing smacks of cmsRun forking off a script to produce the LHE file or something like this. There might actually be code in cmssw that deals with this. IIRC these files are produced in a strange way and @vlimant worked on this.
Ok, so let me summarize:
1) We have the two processing job logArchive and according to the logs and FJR we wrote out different lumis.
2) We have the log and FJR of the merge job and according to log and FJR it
a) read in the two files from the jobs in 1) and b) read in different lumis than what was written in 1)
Is that correct and all confirmed ?
Hm, so it could be a bug in that code which causes the lumis for one of the jobs in 1) to be reported wrong ? Is that what your are saying @ericvaandering ?
The LHEOutput.root file is written by the standard PoolOutputModule. The file format you are thinking about ends with .lhe.
I believe there were two separate merge jobs. Both jobs report having merged the same 2 lumis.
I believe there were two separate merge jobs. Both jobs report having merged the same 2 lumis.
Ok, but one of these merge jobs had an input that was an output of one of the files in 1) and there is a match in file name but a mismatch in lumis ?
If so, sometime between the job in 1) writing it's output and the merge job in 2) reading it's input the content of the file changed. Which is a bit scary... Apart from the name, does other meta-data match ? Like number of events written by the job in 1) and number of events read by the merge job in 2) ?
For this workflow which is the only one writing in its output
pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417
event completion real 1999031 expected 2000000 event completion real 1997739 expected 2000000 pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417 has duplicates { "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM": false, "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/LHE": true } { "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM": {}, "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/LHE": { "0:7592": [ "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root", "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root" ], "0:7591": [ "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root", "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root" ] }
it's the second instance in not too long so there must be something wrong @prozober