dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Duplicate lumisection in single workflow #7371

Closed vlimant closed 5 years ago

vlimant commented 8 years ago

For this workflow which is the only one writing in its output

pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417

event completion real 1999031 expected 2000000 event completion real 1997739 expected 2000000 pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417 has duplicates { "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM": false, "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/LHE": true } { "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/GEN-SIM": {}, "/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/RunIISummer15wmLHEGS-MCRUN2_71_V1-v1/LHE": { "0:7592": [ "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root", "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root" ], "0:7591": [ "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root", "/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root" ] }

it's the second instance in not too long so there must be something wrong @prozober

paorozo commented 8 years ago

The only thing I found in the agent (vocms0116) were logs like these [1]. @amaltaro do you think these errors could cause duplicates?

[1] install/wmagent/JobAccountant/ComponentLog:2016-11-03 20:34:08,337:140528953976576:ERROR:AccountantWorker:Bad FwkJobReport Path: /data/srv/wmagent/v1.0.21.patch2/install/wmagent/JobCreator/JobCache/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0161102_162051_8417/Production/JobCollection_11079_0/job_119483/Report.0.pkl install/wmagent/ErrorHandler/ComponentLog:2016-11-03 20:36:36,076:140194997511936:ERROR:ErrorHandlerPoller:Failed to find FWJR for job 119483 in location /data/srv/wmagent/v1.0.21.patch2/install/wmagent/JobCreator/JobCache/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0161102_162051_8417/Production/JobCollection_11079_0/job_119483/Report.0.pkl

amaltaro commented 8 years ago

Please keep this workflow active if possible. I'll try to have a look at the job configurations to see if they were really meant to produce the same lumi sections (bear in mind this week is not very productive in terms of real work).

Just in case, these 2 different files really produced the same lumi section 7591 (as cryptically reported above): https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?logical_file_name=/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root

vlimant commented 8 years ago

7592 is also duplicated

vlimant commented 8 years ago

We won't be able to keep the workflow in that status forever, as we have to deliver it. Any chance you can grab the information you need from the db to understand this before we deliver this back to analysis ?

ticoann commented 8 years ago

Hi Jean-Roch, you can process I copied problem job reports here /afs/cern.ch/user/c/cmst1/test/duplicatelumi

Problem seems to be there is difference in fwjr and wmbs record. (vocms0116)

Looking at the wmbs, following is unmerged file containing (7591, 7592) lumis.

id=378511 /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root

This file contains the above unmerged file. id=565303 (merged one) created by job 280836 /store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/D09C66C9-AFA2-E611-9A52-0CC47AA98A0E.root

But first one does not. (in webs_file_parent table) id=395554 created by job 266543

/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root'

But both of merged files contains duplicate lumis although second job doesn't have the input file which contain the lumis.

select * from wmbs_file_runlumi_map where fileid=395554 and lumi in (7591, 7592); select * from wmbs_file_runlumi_map where fileid=565303 and lumi in (7591, 7592);

Looking at the fwjr for job 266543, following input file seems to have lumis (7591, 7592)

id=284841 /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root

But wmbs has different record, (lump 7499, 7500)

select run, lumi from wmbs_file_runlumi_map wm inner join wmbs_file_details wd on wm.fileid=wd.id where wd.lfn='/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root';

1 7499 1 7500

Looking at the source code. This is where input file information is added

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/XMLParser.py#L120 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/Report.py#L493

I am not completely sure how CMSSW gets the run lumi information. But it looks like there there is difference between CMSSW framework report run lumi information and wmbs run lumi information. There might some edge case masking problem since lumi numbers are only off by 2. (CMSSW side?)

Here is the partial fwjr which contains the same lumi section

for job 266543-0
"runs": {
                               "1": [
                                   7591,
                                   7592
                               ]
                           },
                           "input_source_class": "PoolSource",
                           "input_type": "primaryFiles",
                           "lfn": "/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root",
                           "pfn": "/mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root",
                           "catalog": "",
                           "module_label": "source",
                           "guid": "A685DDD8-5CA1-E611-9020-A0000420FE80",
                           "events": 323
For job 280836-0

{
                           "runs": {
                               "1": [
                                   7591,
                                   7592
                               ]
                           },
                           "input_source_class": "PoolSource",
                           "input_type": "primaryFiles",
                           "lfn": "/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root",
                           "pfn": "root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root",
                           "catalog": "",
                           "module_label": "source",
                           "guid": "BEA7849C-FDA1-E611-B937-0CC47AD98B90",
                           "events": 323
                       },
vlimant commented 8 years ago

@prozober please do. @ticoann that could be worth passing on the edm people over HN @Dr15Jones

Dr15Jones commented 8 years ago

Do you have logs and configuration files for the jobs which produced output with the same lumis?

ticoann commented 8 years ago

Hi Chris, Here is the config file.

https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/0e64d45e2cc39207afb73e7fd4bb62f4/configFile

Right now I only have a condor logs, @amaltaro, Alan, does logArchive job contains CMSSW log? Anyway, logArchive contains cmssw log, here is pfns - (just for tracking purpose). (It might be already deleted)

srm://srm.rcac.purdue.edu:8443/srm/v2/server?SFN=/mnt/hadoop//store/unmerged/logs/prod/2016/11/4/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/Production/ProductionMergeLHEoutput/0000/0/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz

srm://cmssrm.hep.wisc.edu:8443/srm/v2/server?SFN=/hdfs/store/unmerged/logs/prod/2016/11/4/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/Production/ProductionMergeLHEoutput/0000/0/4aadd12e-a2af-11e6-82a7-02163e017c3c-2-0-logArchive.tar.gz

If they are deleted, we can dig into logArchive tar ball containing above logs.

Let me know if you want me to send the log files via email. Thanks,

amaltaro commented 8 years ago

Yes, and it should also contain the original FJR (and the PSet)

Dr15Jones commented 8 years ago

On Nov 10, 2016, at 9:42 PM, ticoann notifications@github.com wrote:

Let me know if you want me to send the log files via email.

If you can make the log files, FJR and the PSet available (can just be in AFS or on the web) that would be very helpful.

Chris

Christopher Jones Fermi National Accelerator Laboratory cdj@fnal.gov

ticoann commented 8 years ago

Of course the files are already deleted. :-(.

I think these are log files containing above logArchive files. I am trying to copy the log I am not sure how big these are whether I have enough space. Alan do you usually copy in your home dir? (I just need to use xrdcp command, right? do you know the command I can check the size of file?)

Thanks,

root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2016/11/WMAgent/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417-ProductionLHEoutputMergeLogCollect-g31n41-3-logs.tar

root://castorcms.cern.ch//castor/cern.ch/cms/store/logs/prod/2016/11/WMAgent/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417/pdmvserv_TOP-RunIISummer15wmLHEGS-00054_00206_v0__161102_162051_8417-ProductionLHEoutputMergeLogCollect-cms-h003-11-logs.tar
amaltaro commented 8 years ago

Seangchan, these are the Merge logcollects and hopefully there is only one tarball inside each of them. I copied their logArchiver over here: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/4aadd12e-a2af-11e6-82a7-02163e017c3c-2-0-logArchive.tar.gz

where you can find information for the 2 LFNs you previously mentioned:

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root
/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root

According to the Merge FJR, indeed two files had lumi sections 7591 and 7592.

I asked Valentin/Yuyi to search for the logArchive where those 2 files were created.

amaltaro commented 8 years ago

And these are the logs that produced the unmerged files: http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-795-1-logArchive.tar.gz

It can be seen in these FJRs, but indeed one job produced a file with lumis 7499 and 7500 and the other job produced a file with lumis 7591 and 7592.

Seangchan, WMCore does not provide run/lumi mask for merge jobs, does it? We just provide the unmerged files to be merged and the framework takes care of it AND creates the FJR, that is then used by the agent for publishing this data and so on. Am I right?

IF so, then what needs to be figured out is why CMSSW wrote the wrong lumi sections into the FJR.

ticoann commented 8 years ago

Thanks Alan for the logAchive files.

Seangchan, WMCore does not provide run/lumi mask for merge jobs, does it? We just provide the unmerged files to be merged and the framework takes care of it AND creates the FJR, that is then used by the agent for publishing this data and so on. Am I right?

Yes you are right. We don't provide run/lumi mask. @Dr15Jones, how CMSSW gets the run lumi information from unmarked file. (Is the file contains that information?)

Here are the log files Alan posted.

merge job http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/4aadd12e-a2af-11e6-82a7-02163e017c3c-2-0-logArchive.tar.gz

production job

http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-795-1-logArchive.tar.gz

Dr15Jones commented 8 years ago

Hi everyone,

On Nov 11, 2016, at 5:27 PM, ticoann notifications@github.com wrote:

Thanks Alan for the logAchive files.

Seangchan, WMCore does not provide run/lumi mask for merge jobs, does it? We just provide the unmerged files to be merged and the framework takes care of it AND creates the FJR, that is then used by the agent for publishing this data and so on. Am I right? Without a mask the framework would merge all Lumis into the output.

Yes you are right. We don't provide run/lumi mask. @Dr15Jones, how CMSSW gets the run lumi information from unmarked file. (Is the file contains that information?) The LuminosityBlocks are known internally to the file. There really is no way the framework can misread that information.

Looking at the logs from the jobs, the framework does in deed say the files in question have the same luminosity blocks. The following are excerpts from the merge job cmsRun logs:

--------------------------merge a------------- 04-Nov-2016 11:58:16 CDT Successfully opened file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root %MSG-w MemoryCheck: source 04-Nov-2016 11:58:17 CDT PostBeginLumi MemoryCheck: module source:source VSIZE 550.98 2.00391 RSS 158.156 2.98438

%MSG Begin processing the 22934th record. Run 1, Event 1225786, LumiSection 7591 at 04-Nov-2016 11:58:17.120 CDT ... Begin processing the 23256th record. Run 1, Event 1226108, LumiSection 7592 at 04-Nov-2016 11:58:17.302 CDT 04-Nov-2016 11:58:17 CDT Closed file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root 04-Nov-2016 11:58:17 CDT Initiating request to open file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F6C514D5-5CA1-E611-857C-003048F5B69A.root 04-Nov-2016 11:58:17 CDT Successfully opened file root://cmsxrootd.hep.wisc.edu//store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F6C514D5-5CA1-E611-857C-003048F5B69A.root %MSG-w MemoryCheck: source 04-Nov-2016 11:58:17 CDT PostBeginLumi MemoryCheck: module source:source VSIZE 552.984 2.00391 RSS 159.082 0.464844

%MSG Begin processing the 23257th record. Run 1, Event 1227078, LumiSection 7599 at 04-Nov-2016 11:58:17.543 CDT ... ----------------------------merge b----------- 03-Nov-2016 22:25:19 EDT Successfully opened file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root Begin processing the 111759th record. Run 1, Event 1225786, LumiSection 7591 at 03-Nov-2016 22:25:19.258 EDT

Begin processing the 112081st record. Run 1, Event 1226108, LumiSection 7592 at 03-Nov-2016 22:25:19.352 EDT 03-Nov-2016 22:25:19 EDT Closed file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root 03-Nov-2016 22:25:19 EDT Initiating request to open file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/4203A3B8-5CA1-E611-A8FE-24BE05CE4D91.root 03-Nov-2016 22:25:19 EDT Successfully opened file /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/4203A3B8-5CA1-E611-A8FE-24BE05CE4D91.root %MSG-w MemoryCheck: source 03-Nov-2016 22:25:19 EDT PostBeginLumi MemoryCheck: module source:source VSIZE 547.305 0 RSS 298.508 0.40625

%MSG Begin processing the 112082nd record. Run 1, Event 1211251, LumiSection 7501 at 03-Nov-2016 22:25:19.431 EDT

Is it possible to get access to the files in question? How do you know that the files here match the logs for the production?

Chris
ticoann commented 8 years ago

Thanks Chris for looking into the problem,

Is it possible to get access to the files in question?

I think it is all deleted again. I tried to get it from worker node but wasn't there. And we don't keep the unmerged file some where else. (@amaltaro, Alan, that is correct, right?)

How do you know that the files here match the logs for the production?

We keep track in agent side as well.

To summarize what how the accounting process goes in agent.

  1. Production(Processing as well) job is created with specific job splitting algorithm (mask is passed for each job) - When job is finished, output file info is comes in FWJR (WMAgent FWJR which contains run lumi information for output file from CMSWS FWJR) We record this in wmbs db.
  2. When merge job is created and submitted the same thing happen.

What we found is basically unmatched record in following file.

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root

The FWJR for the production job for above output file contains run 1, lumi 7499, 7500. But FWJR for the merge job including above file indicates it has run 1, lumi 7591, 7592.

Although it is the same file. I am guessing second information (FWJR from merge job) is incorrect. But I am not sure how CMSSW gets that information and how the difference could happen.

ticoann commented 8 years ago

If you look at the fwjr from it says lumi section, 7499 and 7500

http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz

<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root</LFN>
<PFN>LHEoutput.root</PFN>
<Catalog></Catalog>
<ModuleLabel>LHEoutput</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
<Branches>
  <Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<OutputModuleClass>PoolOutputModule</OutputModuleClass>
<TotalEvents>323</TotalEvents>

<DataType>MC</DataType>

<BranchHash>baad18a51a4cbab54f85db48954b2717</BranchHash>

<Runs>
<Run ID="1">
   <LumiSection ID="7499"/>
   <LumiSection ID="7500"/>
</Run>

but from merge Job fwjr for the same file lumisections are 7591, 7592 http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/a7ca0da2-a235-11e6-82a7-02163e017c3c-0-0-logArchive.tar.gz

<InputFile>
<State  Value="closed"/>
<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root</LFN>
<PFN>/mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
<Branches>
  <Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<InputType>primaryFiles</InputType>
<InputSourceClass>PoolSource</InputSourceClass>
<EventsRead>323</EventsRead>
<Runs>
<Run ID="1">
   <LumiSection ID="7591"/>
   <LumiSection ID="7592"/>
</Run>
Dr15Jones commented 8 years ago

On Nov 11, 2016, at 9:54 PM, ticoann notifications@github.com wrote:

The FWJR for the production job for above output file contains run 1, lumi 7499, 7500. But FWJR for the merge job including above file indicates it has run 1, lumi 7501, 7502. I believe you mistyped and meant 7591 and 7592 (since that is what is shown in the cmsRun log file).

Although it is the same file. That is actually an assumption. We do not have any proof it is actually the same file since we can't look at it.

I am guessing second information (FWJR from merge job) is incorrect. No, that information is correct. The framework reads to LuminosityBlock information from the file. From the actually cmsRun log file output we can see that the framework read the LuminosityBlock as 7591 and 7592 which does not match with what the log file for production said the job which supposedly created that file used as LuminosityBlock numbers.

The most likely scenario to me is either 1) the production log file you pointed me to was not the one that created that file OR 2) the file isn't the file the system thinks it is

Either is pretty scary.

I do not see any sign this is a framework problem as the log files produced by cmsRun match exactly with the FrameworkJobReport for those same logs.

    Chris
amaltaro commented 8 years ago

No, we don't keep the unmerged files. Once they are merged, there is a cleanup job that comes in and thrown then into the trash bin.

I do not see any sign this is a framework problem as the log files produced by cmsRun match exactly with the FrameworkJobReport for those same logs.

I agree, for the production jobs. However, each of those two merge jobs read these files and in their fjr they report different lumi sections.

ticoann commented 8 years ago

I believe you mistyped and meant 7591 and 7592 (since that is what is shown in the cmsRun log file).

Sorry my mistake. updated. :smile:

The most likely scenario to me is either 1) the production log file you pointed me to was not the one that created that file OR 2) the file isn't the file the system thinks it is

I am still confused how this could explain the difference between fwjr from unmarged file and merged file result for exact the same file. (Maybe 2) does.

This is weird since framework reads Lumi information directly from file, both should be the same. (Unless there is uuid collision and same name of the file is create after which is unlikely since we have the file with those lumis).

Dr15Jones commented 8 years ago

On Nov 11, 2016, at 10:26 PM, Alan Malta Rodrigues notifications@github.com wrote:

No, we don't keep the unmerged files. Once they are merged, there is a cleanup job that comes in and thrown then into the trash bin.

I do not see any sign this is a framework problem as the log files produced by cmsRun match exactly with the FrameworkJobReport for those same logs.

I agree, for the production jobs. However, each of those two merge jobs read these files and in their fjr they report different lumi sections.

In all cases the log file output of cmsRun matches the framework job report made by that same cmsRun job. So that backs the claim that the LuminosityBlock recorded in the file being read is what is in the log file.

Therefore the most likely conclusion is the file written by the production job is not actually the same file being read by the merge job.

Do you have the final merge files? In principal we could look at the values for the events in those 'same' LuminosityBlocks and see if they are actually the same events.

Chris
Dr15Jones commented 8 years ago

On Nov 11, 2016, at 10:27 PM, ticoann notifications@github.com wrote:

This is weird since framework reads Lumi information directly from file, both should be the same. (Unless there is uuid collision and same name of the file is create after which is unlikely since we have the file with those lumis).

Looking at the configuration file given to cmsRun, the framework is told to use a very generic name for the output file /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root

something in the workflow management system then renames the file. Therefore we actually have no idea if the file read by the merge job is actually the one produced by the production job.

Chris
Dr15Jones commented 8 years ago

The production file is not the same file as used in the merge file. The GUIDs do not match.

From the FJR of the production job we have

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/GEN-SIM/MCRUN2_71_V1-v1/130000/RAWSIMoutput.root RAWSIMoutput.root RAWSIMoutput 006427D9-5CA1-E611-9020-A0000420FE80 but from the FJR of the merge job we have /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root /mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root source A685DDD8-5CA1-E611-9020-A0000420FE80 LHEEventProduct_externalLHEProducer__SIM. primaryFiles PoolSource 323 The very first part of the GUIDs do not match 006427D9-5CA1-E611-9020-A0000420FE80 A685DDD8-5CA1-E611-9020-A0000420FE80 ``` Chris ```
ticoann commented 8 years ago

Hi Chris, I think you are looking at the wrong file. GUUI matches with Production job. (A685DDD8-5CA1-E611-9020-A0000420FE80)

http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz

<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root</LFN>
<PFN>LHEoutput.root</PFN>
<Catalog></Catalog>
<ModuleLabel>LHEoutput</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>
<Branches>
  <Branch>LHEEventProduct_externalLHEProducer__SIM.</Branch>
</Branches>
<OutputModuleClass>PoolOutputModule</OutputModuleClass>
<TotalEvents>323</TotalEvents>

<DataType>MC</DataType>

<BranchHash>baad18a51a4cbab54f85db48954b2717</BranchHash>

<Runs>
<Run ID="1">
   <LumiSection ID="7499"/>
   <LumiSection ID="7500"/>
</Run>
ticoann commented 8 years ago

Looking at the configuration file given to cmsRun, the framework is told to use a very generic name for the output file /store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root

something in the workflow management system then renames the file. Therefore we actually have no idea if the file read by the merge job is actually the one produced by the production job.

Is that mean actually file is created in generic name and renamed later? Actual file name is created by LHEoutput to GUID in fwjr.

<ModuleLabel>LHEoutput</ModuleLabel>
<GUID>A685DDD8-5CA1-E611-9020-A0000420FE80</GUID>

Isn't the causing the race condition if two jobs putting the files in the same directory? Is that that agent code which rename the file?

Dr15Jones commented 8 years ago

On Nov 11, 2016, at 11:02 PM, ticoann notifications@github.com wrote:

Hi Chris, I think you are looking at the wrong file. GUUI matches with Production job. (A685DDD8-5CA1-E611-9020-A0000420FE80)

http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_7371/1e19574a-a152-11e6-a6e4-02163e017c3c-749-0-logArchive.tar.gz

The confusion was there are two files written by the production job, both with very similar GUIDs

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root LHEoutput.root LHEoutput A685DDD8-5CA1-E611-9020-A0000420FE80

and

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/GEN-SIM/MCRUN2_71_V1-v1/130000/RAWSIMoutput.root RAWSIMoutput.root RAWSIMoutput 006427D9-5CA1-E611-9020-A0000420FE80
ticoann commented 8 years ago

After talking to Chris, Eric,

I am guessing this is what happens (but I don't have concrete evidence) For production job - 266543

  1. CMSSW framework copy the unmerged file with following name (from PSet config which is controlled by PSetTweeks code) - cmssw frame work job report is already created at this point.

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root

  1. WMCore code? renames the output by replacing LHEoutput to GUID from fwjr. - I still couldn't find the code doing this).

In this case there could a race condition there Other job can writ the the same file before 2 happens.

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root

Basically first job which wrote to that path is overwritten by second job before it gets renamed. So first unmerged output file is lost and second unmerged output file is duplicated to different files. In our case,

/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root

and

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/BEA7849C-FDA1-E611-B937-0CC47AD98B90.root

To fix this, following patch might work. https://github.com/dmwm/WMCore/commit/84cfa939ee56733e8e6fa2b026ae69889b8b740e

But I can't still find where the step 2 happen which would affect the patch. @amaltaro, do you know where the code resides doing that?

Dr15Jones commented 8 years ago

Just to clarify, the framework does not copy the unmerged file. The framework just writes the file in the local directory of the worker node. The copy is done by the workflow system.

hufnagel commented 8 years ago

The file

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root

never exists. It's a bogus entry in the some intermediate reports that is then supplemented by the stageout code which creates the real file on the mass storage. Are you saying we actually wrote an LHEoutput.root file to mass storage ???

ticoann commented 8 years ago

Chris told me that (/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root) is what cmssw framework writes. Not the unmerged file which exists. (/store/mc/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/F08187C8-35A2-E611-ADBF-A0000420FE80.root)

@Dr15Jones, Chris, did I misunderstood what you told me?

Dr15Jones commented 8 years ago

The framework writes ./LHEoutput.root.

ticoann commented 8 years ago

but where? which parameter is it using for current directory?

amaltaro commented 8 years ago

I'm pretty sure ./LHEoutput.root is written in the pilot scratch area. When it's moved to the storage, it's moved with a UID.

hufnagel commented 8 years ago

The GUID makes it into the report here

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/FileInfo.py#L72

This is also where we override the report to account for direct stageout to merged.

One comment, this assume that the report already contains

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root

so either @Dr15Jones is wrong about the ./LHEoutput.root or we have yet another piece of wrapper code.

hufnagel commented 8 years ago

And yes, we write LHEoutput.root into the current working directory, but that doesn't neccessarily mean that the FJR also reports ./LHEoutput.root.

ticoann commented 8 years ago

Hmm then, I am not sure how this happens. pilot scratch area is unique to a job, right? Then where following parameter is used https://github.com/dmwm/WMCore/blob/master/src/python/PSetTweaks/WMTweak.py#L509

Dr15Jones commented 8 years ago

I believe the FJR reports the 'logical' file name as passed to it from the configuration.

hufnagel commented 8 years ago

Yep, the LFN reported by the FJR has NOTHING to do with where the job puts the file on the local filesystem. It's a LFN that only has meaning for the stageout as a remote directory.

ticoann commented 8 years ago

@hufnagel, so PFN below is the current working (job running directory).

<LFN>/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/LHEoutput.root</LFN>
<PFN>LHEoutput.root</PFN>

and stage out moves the file to directly to below pin, (meaning file name is not renamed after file is copied but before file is copied, correct?)

<PFN>/mnt/hadoop/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/A685DDD8-5CA1-E611-9020-A0000420FE80.root</PFN>
hufnagel commented 8 years ago

No. Current working directory is unknown, no idea what it is. We don't need to know.

/store/unmerged/RunIISummer15wmLHEGS/ST_FCNC-TLL_Tleptonic_kappa_zut-MadGraph5-pythia8/LHE/MCRUN2_71_V1-v1/130000/

has no meaning whatsoever as far as the worker node file system is concerned. It's a string, nothing more.

That string + GUID IS the stageout location though.

Renamed is the wrong word here. These are all just string manipulations, the first time this corresponds to an actual path and file on a disk somewhere is for the stageout target.

hufnagel commented 8 years ago

Ie. LFN from the CMSSW FJR does NOT correspond to an actual file somewhere, it's completely bogus. It's just a way for us to forward the UnmergedLFNBase from the workflow configuration to the FJR.

hufnagel commented 8 years ago

Not sure why this was implemented like this. Before my time. We could also have skipped the LFN completely and in

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/FileInfo.py#L72

constructed the LFN from scratch by doing UnmergedLFNBase+GUID.

ticoann commented 8 years ago

Two different pilots never run in the same work space (directory). This is correct statement, right? So now I don't know how this happened, then. 😞

hufnagel commented 8 years ago

Correct.

ericvaandering commented 8 years ago

I haven't had time to follow all this, but the LHEOutput.root thing smacks of cmsRun forking off a script to produce the LHE file or something like this. There might actually be code in cmssw that deals with this. IIRC these files are produced in a strange way and @vlimant worked on this.

hufnagel commented 8 years ago

Ok, so let me summarize:

1) We have the two processing job logArchive and according to the logs and FJR we wrote out different lumis.

2) We have the log and FJR of the merge job and according to log and FJR it

a) read in the two files from the jobs in 1) and b) read in different lumis than what was written in 1)

Is that correct and all confirmed ?

hufnagel commented 8 years ago

Hm, so it could be a bug in that code which causes the lumis for one of the jobs in 1) to be reported wrong ? Is that what your are saying @ericvaandering ?

Dr15Jones commented 8 years ago

The LHEOutput.root file is written by the standard PoolOutputModule. The file format you are thinking about ends with .lhe.

Dr15Jones commented 8 years ago

I believe there were two separate merge jobs. Both jobs report having merged the same 2 lumis.

hufnagel commented 8 years ago

I believe there were two separate merge jobs. Both jobs report having merged the same 2 lumis.

Ok, but one of these merge jobs had an input that was an output of one of the files in 1) and there is a match in file name but a mismatch in lumis ?

If so, sometime between the job in 1) writing it's output and the merge job in 2) reading it's input the content of the file changed. Which is a bit scary... Apart from the name, does other meta-data match ? Like number of events written by the job in 1) and number of events read by the merge job in 2) ?