dmwm / CRABServer

16 stars 38 forks source link

Incorrect outfilelumis parameter #5061

Closed juztas closed 8 years ago

juztas commented 9 years ago

Is this known? http://submit-4.t2.ucsd.edu/CSstoragePath/77/uscms5930/151103_163809:wulsin_crab_candidateTrackProducer_SingleMuon_2015D_v3/postjob.237.2.txt

Tue, 03 Nov 2015 21:33:55 PST(+0000):DEBUG:PostJob Uploading file metadata for miniAODWithCandidateTracks.root to https://cmsweb.cern.ch/crabserver/prod/filemetadata: [('outlfn', '/store/user/wulsin/SingleMuon/Run2015D-05Oct2015-v1/151103_163809/0000/miniAODWithCandidateTracks_237.root'), ('outdatasetname', u'/SingleMuon/wulsin-Run2015D-05Oct2015-v1-648b8b56b17f056199dda35df48180f8/USER'), ('globalTag', 'None'), ('checksumadler32', u'55716676'), ('pandajobid', 237), ('checksumcksum', u'969482382'), ('outlocation', 'T3_US_OSU'), ('outtmplfn', '/store/temp/user/wulsin.cfa6996acb60007a8a99da762cf957b129a9baf9/SingleMuon/Run2015D-05Oct2015-v1/151103_163809/0000/miniAODWithCandidateTracks_237.root'), ('outsize', 2214152620), ('appver', 'CMSSW_7_4_12_patch4'), ('outtype', 'EDM'), ('checksummd5', 'asda'), ('publishdataname', u'Run2015D-05Oct2015-v1-648b8b56b17f056199dda35df48180f8'), ('taskname', '151103_163809:wulsin_crab_candidateTrackProducer_SingleMuon_2015D_v3'), ('acquisitionera', 'null'), ('outtmplocation', u'T2_US_UCSD'), ('events', 35404), ('directstageout', 0), ('outfileruns', '257394'), ('outfileruns', '256941'), ('outfilelumis', '41,42,43,44,45,56,57,68,69,70'), ('outfilelumis', '251,252,253,254,255,257,258,259,260,261,262,268,269,270,290,293,294'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/MINIAOD/05Oct2015-v1/30000/4407562C-956F-E511-BFEA-0025905A612C.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/0C7BB443-DD61-E511-80A5-02163E014691.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/D6F50C47-EC61-E511-A3E5-02163E0121C6.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/58881C19-FB61-E511-85CE-02163E0146E2.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/D6F50C47-EC61-E511-A3E5-02163E0121C6.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/257/394/00000/70A3C1CB-1864-E511-B0F5-02163E0142D8.root'), ('inparentlfns', '/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/257/394/00000/588E61BC-1E64-E511-B9B6-02163E014681.root')]
Tue, 03 Nov 2015 21:33:56 PST(+0000):INFO:PostJob ====== Finished upload of output files metadata.
Tue, 03 Nov 2015 21:33:56 PST(+0000):INFO:PostJob ====== Starting upload of input files metadata.
Tue, 03 Nov 2015 21:33:56 PST(+0000):DEBUG:PostJob Uploading file metadata for input file /store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/0C7BB443-DD61-E511-80A5-02163E014691.root
Tue, 03 Nov 2015 21:33:57 PST(+0000):DEBUG:PostJob Uploading file metadata for input file /store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/D6F50C47-EC61-E511-A3E5-02163E0121C6.root
Tue, 03 Nov 2015 21:33:58 PST(+0000):DEBUG:PostJob Uploading file metadata for input file /store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/58881C19-FB61-E511-85CE-02163E0146E2.root
Tue, 03 Nov 2015 21:34:00 PST(+0000):DEBUG:PostJob Uploading file metadata for input file /store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/D6F50C47-EC61-E511-A3E5-02163E0121C6.root
Tue, 03 Nov 2015 21:34:01 PST(+0000):DEBUG:PostJob Uploading file metadata for input file /store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root
Tue, 03 Nov 2015 21:34:02 PST(+0000):ERROR:PostJob Error uploading input file metadata: {'Content-Length': '714', 'X-Error-Http': '400', 'X-Rest-Time': '3165.007 us', 'Server': 'CherryPy/3.2.2', 'Connection': 'close', 'X-Error-Detail': 'Invalid input parameter', 'CMS-Server-Time': 'D=10812 t=1446615242112916', 'X-Rest-Status': '302', 'Date': 'Wed, 04 Nov 2015 05:34:02 GMT', 'Content-Type': 'text/html;charset=utf-8', 'X-Error-Info': "Incorrect 'outfilelumis' parameter", 'X-Error-Id': '87b3287578c40a2034595e60fb892d83'}
Tue, 03 Nov 2015 21:34:08 PST(+0000):ERROR:PostJob Fatal error uploading input files metadata: 
Traceback (most recent call last):
  File "TaskWorker/Actions/PostJob.py", line 1552, in execute_internal
    self.upload_input_files_metadata()
  File "TaskWorker/Actions/PostJob.py", line 1826, in upload_input_files_metadata
    self.server.put(self.rest_uri_no_api + '/filemetadata', data = urllib.urlencode(configreq))
  File "RESTInteractions.py", line 75, in put
    return self.makeRequest(uri = uri, data = data, verb = 'PUT')
  File "RESTInteractions.py", line 113, in makeRequest
    capath=caCertPath)#, verbose=True)# for debug
  File "WMCore/Services/pycurl_manager.py", line 172, in request
    raise exc
HTTPException
Tue, 03 Nov 2015 21:34:08 PST(+0000):INFO:PostJob ====== Finished upload of input files metadata.
AndresTanasijczuk commented 9 years ago

The file for which this error occurred is a secondary input file (parent) in this workflow. config.Data.inputDataset = '/SingleMuon/Run2015D-05Oct2015-v1/MINIAOD' config.Data.useParent = True The primary input file is a MINIAOD file, which has 7 AOD secondary files (parents). This is the JSON framework job report for job 237 created by WMCore: http://submit-4.t2.ucsd.edu/CSstoragePath/77/uscms5930/151103_163809:wulsin_crab_candidateTrackProducer_SingleMuon_2015D_v3/job_fjr.237.0.json In particular you can see that the file /store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root has a run without lumis in the fjr: "257394": [ ], This is what is causing the problem. And this originates from the FrameworkJobReport-237.xml file (which you can get with crab getlog --jobids=237) where the section for this secondary input file looks like this:

<InputFile>
<State  Value="closed"/>
<LFN>/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root</LFN>
<PFN>root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root</PFN>
<Catalog></Catalog>
<ModuleLabel>source</ModuleLabel>
<GUID>948D60AB-0662-E511-B3CD-02163E011F3B</GUID>
<Branches>
  <Branch>BeamSpotOnlines_scalersRawToDigi__RECO.</Branch>
  <Branch>CSCDetIdCSCSegmentsOwnedRangeMap_cscSegments__RECO.</Branch>
  <Branch>CastorRecHitsSorted_castorreco__RECO.</Branch>
...
...
</Branches>
<InputType>secondaryFiles</InputType>
<InputSourceClass>PoolSource</InputSourceClass>
<EventsRead>3168</EventsRead>
<Runs>
<Run ID="256941">
   <LumiSection ID="290"/>
   <LumiSection ID="293"/>
   <LumiSection ID="294"/>
</Run>

<Run ID="257394">
</Run>

</Runs>
</InputFile>

So you see that already the XML FJR has

<Run ID="257394">
</Run>

</Runs>

Looking at the job log cmsRun-stdout-237.log:

03-Nov-2015 17:36:27 PST  Initiating request to open file /hadoop/cms/phedex/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root
03-Nov-2015 17:36:27 PST  Fallback request to file root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root
%MSG-w XrdAdaptor:  source 03-Nov-2015 17:36:28 PST PostEndLumi
Data is served from fnal.gov instead of original site T1_US_FNAL
%MSG
03-Nov-2015 17:36:29 PST  Successfully opened file root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root
%MSG-w XrdAdaptor:  (NoModuleName) 03-Nov-2015 17:36:34 PST pre-events
Data is now served from fnal.gov, T2_US_Caltech instead of previous fnal.gov
%MSG
Begin processing the 17601st record. Run 256941, Event 417710884, LumiSection 290 at 03-Nov-2015 17:36:59.232 PST
Begin processing the 17701st record. Run 256941, Event 417780029, LumiSection 290 at 03-Nov-2015 17:37:33.949 PST
Begin processing the 17801st record. Run 256941, Event 417116678, LumiSection 290 at 03-Nov-2015 17:38:00.011 PST
Begin processing the 17901st record. Run 256941, Event 417500426, LumiSection 290 at 03-Nov-2015 17:38:27.265 PST
Begin processing the 18001st record. Run 256941, Event 416970751, LumiSection 290 at 03-Nov-2015 17:38:46.704 PST
Begin processing the 18101st record. Run 256941, Event 417884924, LumiSection 290 at 03-Nov-2015 17:39:13.604 PST
Begin processing the 18201st record. Run 256941, Event 417048840, LumiSection 290 at 03-Nov-2015 17:39:54.129 PST
%MSG-w XrdAdaptor:  CandidateTrackProducer:candidateDisappearingTracks  03-Nov-2015 17:39:55 PST Run: 256941 Event: 416603042
Data is now served from T2_US_Caltech instead of previous fnal.gov, T2_US_Caltech
%MSG
Begin processing the 18301st record. Run 256941, Event 417970115, LumiSection 290 at 03-Nov-2015 17:40:00.745 PST
Begin processing the 18401st record. Run 256941, Event 417330267, LumiSection 290 at 03-Nov-2015 17:40:13.980 PST
Begin processing the 18501st record. Run 256941, Event 416676694, LumiSection 290 at 03-Nov-2015 17:40:40.923 PST
Begin processing the 18601st record. Run 256941, Event 416909621, LumiSection 290 at 03-Nov-2015 17:40:47.411 PST
Begin processing the 18701st record. Run 256941, Event 417567424, LumiSection 290 at 03-Nov-2015 17:41:02.993 PST
Begin processing the 18801st record. Run 256941, Event 421702436, LumiSection 293 at 03-Nov-2015 17:41:46.008 PST
Begin processing the 18901st record. Run 256941, Event 422304621, LumiSection 293 at 03-Nov-2015 17:41:52.090 PST
Begin processing the 19001st record. Run 256941, Event 421025327, LumiSection 293 at 03-Nov-2015 17:42:05.361 PST
Begin processing the 19101st record. Run 256941, Event 421388095, LumiSection 293 at 03-Nov-2015 17:42:32.163 PST
Begin processing the 19201st record. Run 256941, Event 421427786, LumiSection 293 at 03-Nov-2015 17:42:47.328 PST
Begin processing the 19301st record. Run 256941, Event 422114399, LumiSection 293 at 03-Nov-2015 17:42:54.701 PST
Begin processing the 19401st record. Run 256941, Event 421190350, LumiSection 293 at 03-Nov-2015 17:43:19.360 PST
Begin processing the 19501st record. Run 256941, Event 421796409, LumiSection 293 at 03-Nov-2015 17:43:34.137 PST
Begin processing the 19601st record. Run 256941, Event 421547969, LumiSection 293 at 03-Nov-2015 17:43:55.166 PST
Begin processing the 19701st record. Run 256941, Event 421326418, LumiSection 293 at 03-Nov-2015 17:44:07.231 PST
Begin processing the 19801st record. Run 256941, Event 421378754, LumiSection 293 at 03-Nov-2015 17:44:20.905 PST
Begin processing the 19901st record. Run 256941, Event 422144630, LumiSection 293 at 03-Nov-2015 17:44:37.550 PST
Begin processing the 20001st record. Run 256941, Event 422002557, LumiSection 293 at 03-Nov-2015 17:44:50.674 PST
Begin processing the 20101st record. Run 256941, Event 422423234, LumiSection 294 at 03-Nov-2015 17:45:03.778 PST
Begin processing the 20201st record. Run 256941, Event 422709230, LumiSection 294 at 03-Nov-2015 17:45:28.808 PST
Begin processing the 20301st record. Run 256941, Event 423123310, LumiSection 294 at 03-Nov-2015 17:45:34.454 PST
Begin processing the 20401st record. Run 256941, Event 422754621, LumiSection 294 at 03-Nov-2015 17:45:48.043 PST
Begin processing the 20501st record. Run 256941, Event 422667128, LumiSection 294 at 03-Nov-2015 17:46:15.868 PST
Begin processing the 20601st record. Run 256941, Event 422893407, LumiSection 294 at 03-Nov-2015 17:46:23.822 PST
Begin processing the 20701st record. Run 256941, Event 422856267, LumiSection 294 at 03-Nov-2015 17:46:36.990 PST
03-Nov-2015 17:46:51 PST  Closed file root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root

There is no indication that an event from run 257394 was processed for this file. Run 257394 was first processed in the next opened secondary input file:

03-Nov-2015 17:46:51 PST  Initiating request to open file /hadoop/cms/phedex/store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/257/394/00000/70A3C1CB-1864-E511-B0F5-02163E0142D8.root
03-Nov-2015 17:46:51 PST  Fallback request to file root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/257/394/00000/70A3C1CB-1864-E511-B0F5-02163E0142D8.root
%MSG-w XrdAdaptor:  source 03-Nov-2015 17:46:52 PST PostEndRun
Data is served from fnal.gov instead of original site T1_US_FNAL
%MSG
03-Nov-2015 17:46:53 PST  Successfully opened file root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/257/394/00000/70A3C1CB-1864-E511-B0F5-02163E0142D8.root
%MSG-w XrdAdaptor:  (NoModuleName) 03-Nov-2015 17:46:57 PST pre-events
Data is now served from fnal.gov, T2_US_Caltech instead of previous fnal.gov
%MSG
Begin processing the 20801st record. Run 257394, Event 82316858, LumiSection 56 at 03-Nov-2015 17:47:51.216 PST
...

Also, I opened the file in ROOT (TFile* f = TFile::Open("root://cmsxrootd.fnal.gov//store/data/Run2015D/SingleMuon/AOD/PromptReco-v3/000/256/941/00000/948D60AB-0662-E511-B3CD-02163E011F3B.root")) and inspected it with a TBrowser, and I only see run number 256941. I don't see any indication of run 257394 being in this file.

belforte commented 9 years ago

the only action I can imagie is to catch the invalid FJR in the code and report a more clear error rather then causing an http exception.

wulsin commented 9 years ago

Hi @belforte,

Can you recommend how to avoid this error and get the crab jobs to finish successfully? If there is not a known solution, then I would request that this issue be given higher priority, because it is preventing me from processing the full 2015D dataset.

In the latest task I submitted [1], 1116 jobs failed out of 8636 total jobs. In the first 5 log files I checked [2] they all failed with "Fatal error uploading input files metadata".

Thanks, Wells

[1] https://cmsweb.cern.ch/crabcache/logfile?name=151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015.log&username=wulsin

[2] https://cmsweb.cern.ch/scheddmon/095/cms1332/151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015/postjob.1.2.txt https://cmsweb.cern.ch/scheddmon/095/cms1332/151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015/postjob.2.2.txt https://cmsweb.cern.ch/scheddmon/095/cms1332/151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015/postjob.3.2.txt https://cmsweb.cern.ch/scheddmon/095/cms1332/151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015/postjob.4.2.txt https://cmsweb.cern.ch/scheddmon/095/cms1332/151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015/postjob.5.2.txt

belforte commented 9 years ago

@wulsin please do not use this tracker to report problems. Use hn-cms-computing-tools@cern.ch this issue was about possibly improving error report in an odd situation which is not the one you refer to.

AndresTanasijczuk commented 9 years ago

Mhhh... The problem @wulsin is having is indeed the same as described in this issue. I checked job 1 retry 0: the job_fjr.1.0.json file (https://cmsweb.cern.ch/scheddmon/095/cms1332/151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015/job_fjr.1.0.json) looks like this:

{

    "log_size": ​13636,
    "task": null,
    "skippedFiles": [ ],
    "fallbackFiles": 

[

    "/store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/674/00000/BCC8B2BE-F05E-E511-9FBD-02163E0136C7.root",
    "/store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/630/00000/861D83E4-215F-E511-BD3F-02163E014736.root"

],
"phedex_node": "T2_RU_IHEP",
"temp_storage_site": "T2_RU_IHEP",
"exitMsg": "OK",
"file_metadata_upload": true,
"storage_site": "T2_US_Purdue",
"aso_start_timestamp": ​1447976998,
"postjob": 
{

    "exitMsg": "Fatal error uploading input files metadata: ",
    "exitCode": ​1

},
"exitAcronym": "OK",
"executed_site": "T2_RU_IHEP",
"aso_start_time": "2015-11-20 02:49:58.941542",
"steps": 
{

    "cmsRun": 

{

    "status": ​1,
    "errors": [ ],
    "logs": { },
    "parameters": { },
    "performance": 

{

    "multicore": { },
    "storage": 

{

    "writeTotalMB": ​6.3392600000000003,
    "readPercentageOps": ​1.0132295719844358,
    "readAveragekB": ​11.886805970149252,
    "readTotalMB": ​14.777249999999999,
    "readNumOps": ​1285.0,
    "readCachePercentageOps": ​0.0,
    "readMBSec": ​0.00018716090562713165,
    "readMaxMSec": ​1243.6600000000001,
    "readTotalSecs": ​0,
    "writeTotalSecs": ​58491.100000000006

},
"cpu": 
{

    "TotalJobCPU": "41.2647",
    "EventThroughput": "0.565571",
    "AvgEventTime": "1.44739",
    "MaxEventTime": "36.1475",
    "TotalJobTime": "128.839",
    "TotalLoopCPU": "31.0233",
    "MinEventTime": "0.186069"

},
"memory": 

    {
        "PeakValueRss": "687.59",
        "PeakValueVsize": "1259.07"
    }

},
"stop": null,
"site": { },
"analysis": { },
"start": null,
"cleanup": { },
"input": 
{

    "source": 

[

{

    "runs": 

    {
        "256630": [ ]
    },
    "input_source_class": "PoolSource",
    "input_type": "secondaryFiles",
    "lfn": "/store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/674/00000/BCC8B2BE-F05E-E511-9FBD-02163E0136C7.root",
    "pfn": "root://xrootd.ba.infn.it//store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/674/00000/BCC8B2BE-F05E-E511-9FBD-02163E0136C7.root",
    "catalog": "",
    "module_label": "source",
    "guid": "BCC8B2BE-F05E-E511-9FBD-02163E0136C7",
    "events": ​0

},
{

    "runs": 

{

    "256630": 

        [
            ​5,
            ​6
        ]
    },
    "input_source_class": "PoolSource",
    "input_type": "primaryFiles",
    "lfn": "/store/data/Run2015D/MET/MINIAOD/05Oct2015-v1/60000/64906591-866F-E511-8154-0025905A606A.root",
    "pfn": "dcap://dp0015.m45.ihep.su:22125/pnfs/m45.ihep.su/data/cms/store/data/Run2015D/MET/MINIAOD/05Oct2015-v1/60000/64906591-866F-E511-8154-0025905A606A.root",
    "catalog": "",
    "module_label": "source",
    "guid": "64906591-866F-E511-8154-0025905A606A",
    "events": ​67

},
{

    "runs": 

{

    "256630": 

                [
                    ​5,
                    ​6
                ]
            },
            "input_source_class": "PoolSource",
            "input_type": "secondaryFiles",
            "lfn": "/store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/630/00000/861D83E4-215F-E511-BD3F-02163E014736.root",
            "pfn": "root://xrootd.ba.infn.it//store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/630/00000/861D83E4-215F-E511-BD3F-02163E014736.root",
            "catalog": "",
            "module_label": "source",
            "guid": "861D83E4-215F-E511-BD3F-02163E014736",
            "events": ​67
        }
    ]

},
"output": 
{

    "analysis": [ ],
    "MINIAODSIMoutput": 

[

{

    "runs": 

{

    "256630": 

    [
        ​5,
        ​6
    ]

},
"lfn": "",
"ouput_module_class": "PoolOutputModule",
"branch_hash": "3ed29ba4cacaefb5c9e389e4fbaf9fc6",
"inputpfns": 
[

    "dcap://dp0015.m45.ihep.su:22125/pnfs/m45.ihep.su/data/cms/store/data/Run2015D/MET/MINIAOD/05Oct2015-v1/60000/64906591-866F-E511-8154-0025905A606A.root",
    "root://xrootd.ba.infn.it//store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/674/00000/BCC8B2BE-F05E-E511-9FBD-02163E0136C7.root",
    "root://xrootd.ba.infn.it//store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/630/00000/861D83E4-215F-E511-BD3F-02163E014736.root"

],
"temp_storage_site": "T2_RU_IHEP",
"pset_hash": "5a015319cafd02fb6d843caeb386b745",
"pfn": "miniAODWithCandidateTracks.root",
"catalog": "",
"module_label": "MINIAODSIMoutput",
"local_stageout": true,
"checksums": 
{

    "adler32": "9bf23c5e",
    "cksum": "1265999872"

},
"guid": "2E7B4DE9-178F-E511-A0E6-00259081ED06",
"input": 

                        [
                            "/store/data/Run2015D/MET/MINIAOD/05Oct2015-v1/60000/64906591-866F-E511-8154-0025905A606A.root",
                            "/store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/674/00000/BCC8B2BE-F05E-E511-9FBD-02163E0136C7.root",
                            "/store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/630/00000/861D83E4-215F-E511-BD3F-02163E014736.root"
                        ],
                        "storage_site": "T2_US_Purdue",
                        "events": ​67,
                        "size": ​6647111
                    }
                ]
            }
        }
    },
    "jobExitCode": ​0,
    "local_stageout": true,
    "exitCode": ​0

}

So the problematic file is this parent file: /store/data/Run2015D/MET/AOD/PromptReco-v3/000/256/674/00000/BCC8B2BE-F05E-E511-9FBD-02163E0136C7.root Link to file on DAS: https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=%2Fstore%2Fdata%2FRun2015D%2FMET%2FAOD%2FPromptReco-v3%2F000%2F256%2F674%2F00000%2FBCC8B2BE-F05E-E511-9FBD-02163E0136C7.root One can see that this file doesn't contain run 256630 (the run this job had to analyze), but only run 256674.

I may have a hint about what is causing this problem. When CRAB includes the parents of a given input file as secondary input files, it is not taking into account the lumis that a given job has to analyze in the input file. For example, suppose job 1 has to analyze a primary input file p1.root which contains two lumis [1, 2] and has two parent files s1.root and s2.root which contain lumis 1 and 2 respectively. Then CRAB will add both parent files to the list of input files even if job 1 has to analyze only lumi 1. https://github.com/dmwm/CRABServer/blob/3.3.1511.rc1/src/python/TaskWorker/Actions/DagmanCreator.py#L470-L477 Maybe the example I gave is too naive; maybe the problem appears only when lumis 1 and 2 would belong to different runs.

belforte commented 9 years ago

thanks @AndresTanasijczuk Indeed I got also convinced that this is some bug on CRAB side, not a corrupted FJR as we thought initially. It may help if postjob would print what it uses at outfilelumis, but IIUC the point is that at times there can be legitimate reason for one input file not to contribute any lumi to the output. I can't say more w/o a full understanding of what the code exactly tries to do. But at first sight we should be able to simply ignore that file.

belforte commented 9 years ago

as additional documentation, before logs are rotate, here is the relevant part of the crabserver log about the failed upload:

[20/Nov/2015:11:54:30]  RESTSQL:laEIDqxLMyHW execute: () {'outlfn': '/store/user/wulsin/MET/Run2015D-05Oct2015-v1/151119_102004/0000/miniAODWithCandidateTracks_1.root', 'taskname': '151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015'}
[20/Nov/2015:11:54:30]  RESTSQL:laEIDqxLMyHW prepare [UPDATE filemetadata SET fmd_tmp_location = :outtmplocation, fmd_size = :outsize, fmd_creation_time = SYS_EXTRACT_UTC(SYSTIMESTAMP), fmd_tmplfn = :outtmplfn                     WHERE tm_taskname = :taskname AND fmd_lfn = :outlfn]
[20/Nov/2015:11:54:30]  RESTSQL:laEIDqxLMyHW executemany: ([{'outsize': '6647111', 'outtmplfn': '/store/user/wulsin/MET/Run2015D-05Oct2015-v1/151119_102004/0000/miniAODWithCandidateTracks_1.root', 'taskname': '151119_102004:wulsin_crab_candidateTrackProducer_MET_2015D_05Oct2015', 'outtmplocation': 'T2_CH_CERN', 'outlfn': '/store/user/wulsin/MET/Run2015D-05Oct2015-v1/151119_102004/0000/miniAODWithCandidateTracks_1.root'}],) {}
[20/Nov/2015:11:54:30]  RESTSQL:laEIDqxLMyHW commit
[20/Nov/2015:11:54:30] vocms0165.cern.ch 128.142.176.228 "PUT /crabserver/prod/filemetadata HTTP/1.1" 200 OK [data: 3462 in 55 out 79901 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=wulsin/CN=729510/CN=Wells Wulsin" "" ] [ref: "" "CRABClient/0.0.0" ]
[20/Nov/2015:11:54:30]  RESTSQL:laEIDqxLMyHW release with rollback
[20/Nov/2015:11:54:30]  RESTSQL:laEIDqxLMyHW RELEASED cms_analysis_reqmgr_w@cmsr timeout=300 inuse=0 idle=9
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI ENTER cms_analysis_reqmgr_w@cmsr CRABInterface.RESTBaseAPI.RESTBaseAPI (PUT prod filemetadata) inuse=0 idle=9
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI (previously laEIDqxLMyHW) connected, client: 11.2.0.4.0, server: 11.2.0.4.0, stmtcache: 50
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI ping
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI check [select sysdate from dual]
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI connection established
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI prepare [SELECT panda_job_id from filemetadata WHERE tm_taskname = :taskname AND fmd_lfn = :outlfn]
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI execute: () {'outlfn': '/store/data/Run2015D/SingleElectron/AOD/PromptReco-v3/000/257/969/00000/82618465-B469-E511-BEA6-02163E0133A4.root_4640', 'taskname': '151119_102040:wulsin_crab_candidateTrackProducer_SingleElectron_2015D_05Oct2015'}
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI prepare [INSERT INTO filemetadata (                tm_taskname, panda_job_id, fmd_outdataset, fmd_acq_era, fmd_sw_ver, fmd_in_events, fmd_global_tag,               fmd_publish_name, fmd_location, fmd_tmp_location, fmd_runlumi, fmd_adler32, fmd_cksum, fmd_md5, fmd_lfn, fmd_size,               fmd_type, fmd_parent, fmd_creation_time, fmd_filestate, fmd_direct_stageout, fmd_tmplfn)                VALUES (:taskname, :pandajobid, :outdatasetname, :acquisitionera, :appver, :events, :globalTag,                       :publishdataname, :outlocation, :outtmplocation, :runlumi, :checksumadler32, :checksumcksum, :checksummd5, :outlfn, :outsize,                       :outtype, :inparentlfns, SYS_EXTRACT_UTC(SYSTIMESTAMP), :filestate, :directstageout, :outtmplfn)]
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI executemany: ([{'filestate': 'None', 'outlfn': '/store/data/Run2015D/SingleElectron/AOD/PromptReco-v3/000/257/969/00000/82618465-B469-E511-BEA6-02163E0133A4.root_4640', 'acquisitionera': 'null', 'outlocation': 'T2_US_Purdue', 'globalTag': 'None', 'outsize': '0', 'checksumcksum': '0', 'runlumi': '{}', 'events': '0', 'outtmplfn': '/store/data/Run2015D/SingleElectron/AOD/PromptReco-v3/000/257/969/00000/82618465-B469-E511-BEA6-02163E0133A4.root_4640', 'pandajobid': '4640', 'appver': 'CMSSW_7_4_12_patch4', 'checksumadler32': '0', 'checksummd5': '0', 'inparentlfns': '[]', 'publishdataname': 'Run2015D-05Oct2015-v1-00000000000000000000000000000000', 'outtype': 'POOLIN', 'outdatasetname': '/FakeDataset/fakefile-FakePublish-5b6a581e4ddd41b130711a045d5fecb9/USER', 'outtmplocation': 'T2_UK_London_IC', 'taskname': '151119_102040:wulsin_crab_candidateTrackProducer_SingleElectron_2015D_05Oct2015', 'directstageout': 'F'}],) {}
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI commit
[20/Nov/2015:11:54:31] vocms0165.cern.ch 169.228.38.38 "PUT /crabserver/prod/filemetadata HTTP/1.1" 200 OK [data: 3004 in 55 out 65274 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=wulsin/CN=729510/CN=Wells Wulsin" "" ] [ref: "" "CRABClient/0.0.0" ]
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI release with rollback
[20/Nov/2015:11:54:31]  RESTSQL:fobSGCFvgHOI RELEASED cms_analysis_reqmgr_w@cmsr timeout=300 inuse=0 idle=9
[20/Nov/2015:11:54:31]  SERVER REST ERROR WMCore.REST.Error.InvalidParameter fbf2f4e54b6cadeb012ea2b3bd53b4c1 (Invalid input parameter)
[20/Nov/2015:11:54:31]    Traceback (most recent call last):
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/WMCore/REST/Server.py", line 697, in default
[20/Nov/2015:11:54:31]        return self._call(RESTArgs(list(args), kwargs))
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/WMCore/REST/Server.py", line 768, in _call
[20/Nov/2015:11:54:31]        v(apiobj, request.method, api, param, safe)
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/CRABInterface/RESTFileMetadata.py", line 30, in validate
[20/Nov/2015:11:54:31]        validate_strlist("outfilelumis", param, safe, RX_LUMILIST)
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/WMCore/REST/Validation.py", line 217, in validate_strlist
[20/Nov/2015:11:54:31]        _validate_all(argname, param, safe, _check_str, rx, custom_err)
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/WMCore/REST/Validation.py", line 89, in _validate_all
[20/Nov/2015:11:54:31]        _arglist(argname, param.kwargs))
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/WMCore/REST/Validation.py", line 88, in <lambda>
[20/Nov/2015:11:54:31]        safe.kwargs[argname] = map(lambda v: checker(argname, v, *args),
[20/Nov/2015:11:54:31]      File "/data/srv/beHG1511c/sw.pre/slc6_amd64_gcc481/cms/crabserver/3.3.1511.rc1/lib/python2.6/site-packages/WMCore/REST/Validation.py", line 38, in _check_str
[20/Nov/2015:11:54:31]        raise InvalidParameter(return_message("Incorrect '%s' parameter" % argname, custom_err))
[20/Nov/2015:11:54:31]    InvalidParameter
[20/Nov/2015:11:54:31] vocms0165.cern.ch 128.142.176.228 "PUT /crabserver/prod/filemetadata HTTP/1.1" 400 Bad Request [data: 2989 in 714 out 3117 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=wulsin/CN=729510/CN=Wells Wulsin" "" ] [ref: "" "CRABClient/0.0.0" ]
[20/Nov/2015:11:54:31]  RESTSQL:iblUSFEdAmts ENTER cms_analysis_reqmgr_w@cmsr CRABInterface.RESTBaseAPI.RESTBaseAPI (PUT prod filemetadata) inuse=0 idle=9
[20/Nov/2015:11:54:31]  RESTSQL:iblUSFEdAmts (previously fobSGCFvgHOI) connected, client: 11.2.0.4.0, server: 11.2.0.4.0, stmtcache: 50
AndresTanasijczuk commented 9 years ago

Yes, we can print the filemetadata content for the input files. We do it only for the log and output files, but not for the input files (maybe we thought this would be very verbose and unnecessary; there can be many input files in a given job). As per the code, I think we have to drop from this list https://github.com/dmwm/CRABServer/blob/3.3.1511.rc1/src/python/TaskWorker/Actions/DagmanCreator.py#L474 the parent files whose run/lumi content intersected with https://github.com/dmwm/CRABServer/blob/3.3.1511.rc1/src/python/TaskWorker/Actions/DagmanCreator.py#L480 is null.

belforte commented 9 years ago

it may be enough to print metadata content only when upload fails, as a general rule. I can;t look at code until next week.

AndresTanasijczuk commented 9 years ago

I tried to reproduce this problem running a very simple task with this CRAB configuration:

from CRABClient.UserUtilities import config
config = config()
config.General.requestName = 'test_parents_2'
config.General.transferOutputs = True
config.General.transferLogs = True
config.JobType.pluginName = 'Analysis'
config.JobType.psetName = 'pset_tutorial_analysis.py'
config.Data.inputDataset = '/SingleMu/Run2012C-24Aug2012-v1/AOD'
config.Data.splitting = 'LumiBased'
config.Data.unitsPerJob = 1
config.Data.useParent = True
config.Data.lumiMask = 'mystupidlumimask.txt'
config.Data.outLFNDirBase = '/store/user/atanasi/trash'
config.Data.publication = True
config.Data.outputDatasetTag = config.General.requestName
config.Site.storageSite = 'T2_US_Nebraska'

where mystupidlumimask.txt is

{"198271": [[749, 750]]}

So the task has two jobs, both analyzing only one lumi in this input file:

/store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root

The run/lumi content of this file is:

Lumi: [43, 44]
Run number: 198041 
Lumi: [[749, 752], [761, 762], [772, 773]]
Run number: 198271
Lumi: [[428, 428], [1148, 1148]]
Run number: 198487

Job 1 analyzed Run number: 198271 Lumi: 749 Job 2 analyzed Run number: 198271 Lumi: 750 This file has 5 parent files:

Parent: /store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root
Parent: /store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root
Parent: /store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root
Parent: /store/data/Run2012C/SingleMu/RAW/v1/000/198/487/CC7D0CA0-71C8-E111-AD49-5404A6388698.root
Parent: /store/data/Run2012C/SingleMu/RAW/v1/000/198/487/F04935CF-97C8-E111-ACAE-001D09F26509.root

The lumis {"198271": [749, 750]} are in the third parent file.

I ran one task in production (http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=crab&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=151126_131425%3Aatanasi_crab_test_parents_2), and although it didn't fail, one can see that the FJR JSON files include the first three parent files and the "runs" information is not correct: For job 1: Relevant part of job_out.1.0.txt:

======== CMSSW OUTPUT STARTING ========
== CMSSW:  Beginning CMSSW wrapper script
== CMSSW:   slc6_amd64_gcc491 scramv1 CMSSW
== CMSSW:  Performing SCRAM setup...
== CMSSW:  Completed SCRAM setup
== CMSSW:  Retrieving SCRAM project...
== CMSSW:  Untarring  /storage/local/data1/condor/execute/dir_13841/glide_k4ZYdO/execute/dir_18352/sandbox.tar.gz
== CMSSW:  Completed SCRAM project
== CMSSW:  Executing CMSSW
== CMSSW:  cmsRun  -j FrameworkJobReport.xml PSet.py
== CMSSW:  26-Nov-2015 07:23:08 CST  Initiating request to open file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root
== CMSSW:  151126 07:23:09 19079 Xrd: CheckErrorStatus: Server [cmssrmdisk.fnal.gov:1095] declared: Request 3018 not supported(error code: 3013)
== CMSSW:  26-Nov-2015 07:23:20 CST  Successfully opened file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root
== CMSSW:  26-Nov-2015 07:23:28 CST  Initiating request to open file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root
== CMSSW:  26-Nov-2015 07:27:05 CST  Successfully opened file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root
== CMSSW:  151126 07:28:38 19106 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
== CMSSW:  26-Nov-2015 07:29:00 CST  Closed file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root
== CMSSW:  26-Nov-2015 07:29:00 CST  Initiating request to open file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root
== CMSSW:  26-Nov-2015 07:32:02 CST  Successfully opened file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root
== CMSSW:  26-Nov-2015 07:35:14 CST  Closed file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root
== CMSSW:  26-Nov-2015 07:35:14 CST  Initiating request to open file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root
== CMSSW:  26-Nov-2015 07:38:17 CST  Successfully opened file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root
== CMSSW:  151126 07:40:39 20513 Xrd: XrdClientMessage::ReadRaw: Failed to read header (8 bytes).
== CMSSW:  Begin processing the 1st record. Run 198271, Event 854151174, LumiSection 749 at 26-Nov-2015 07:41:18.912 CST
== CMSSW:  Begin processing the 2nd record. Run 198271, Event 854120334, LumiSection 749 at 26-Nov-2015 07:41:19.049 CST
== CMSSW:  Begin processing the 3rd record. Run 198271, Event 854142206, LumiSection 749 at 26-Nov-2015 07:41:19.060 CST
== CMSSW:  Begin processing the 4th record. Run 198271, Event 854133518, LumiSection 749 at 26-Nov-2015 07:41:19.065 CST
== CMSSW:  Begin processing the 5th record. Run 198271, Event 854154926, LumiSection 749 at 26-Nov-2015 07:41:19.072 CST
== CMSSW:  Begin processing the 6th record. Run 198271, Event 854186182, LumiSection 749 at 26-Nov-2015 07:41:19.078 CST
== CMSSW:  Begin processing the 7th record. Run 198271, Event 854173070, LumiSection 749 at 26-Nov-2015 07:41:19.083 CST
== CMSSW:  Begin processing the 8th record. Run 198271, Event 854210822, LumiSection 749 at 26-Nov-2015 07:41:19.089 CST
== CMSSW:  Begin processing the 9th record. Run 198271, Event 854188582, LumiSection 749 at 26-Nov-2015 07:41:19.094 CST
== CMSSW:  Begin processing the 10th record. Run 198271, Event 854209638, LumiSection 749 at 26-Nov-2015 07:41:19.100 CST
...
== CMSSW:  Begin processing the 919th record. Run 198271, Event 855088403, LumiSection 749 at 26-Nov-2015 07:41:48.109 CST
== CMSSW:  26-Nov-2015 07:41:48 CST  Closed file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root
== CMSSW:  
== CMSSW:  TrigReport ---------- Event  Summary ------------
== CMSSW:  TrigReport Events total = 919 passed = 919 failed = 0
== CMSSW:  
...
== CMSSW:  26-Nov-2015 07:41:50 CST  Closed file root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root
== CMSSW:  TimeReport> Time report complete in 770.573 seconds
...
== CMSSW:  process id is 19079 status is 0
======== CMSSW OUTPUT FINSHING ========

You can see that cmsRun opened/closed the parent files in order until it found the run/lumi that it had to analyze. Relevant part of job_fjr.1.0.json:

...
{"cmsRun": {
"analysis": {},
"cleanup": {},
"errors": [],
"input": {"source": [
{"catalog": "", "events": 0, "guid": "D48DD19E-0DC4-E111-A80C-BCAEC518FF8D", "input_source_class": "PoolSource", "input_type": "secondaryFiles", "lfn": "/store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root", "module_label": "source", "pfn": "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root", "runs": {"198271": [749]}},   <----- This is not correct; this secondary file does not contain this run/lumi.
{"catalog": "", "events": 0, "guid": "A60FA59A-60C6-E111-83EA-5404A63886BD", "input_source_class": "PoolSource", "input_type": "secondaryFiles", "lfn": "/store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root", "module_label": "source", "pfn": "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root", "runs": {}},   <----- Empty "runs" dictionary; this is acceptable.
{"catalog": "", "events": 919, "guid": "003B900F-E9EE-E111-834F-00259073E3A0", "input_source_class": "PoolSource", "input_type": "primaryFiles", "lfn": "/store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root", "module_label": "source", "pfn": "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root", "runs": {"198271": [749]}},
{"catalog": "", "events": 919, "guid": "E2EEB799-60C6-E111-B84C-0025901D631E", "input_source_class": "PoolSource", "input_type": "secondaryFiles", "lfn": "/store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root", "module_label": "source", "pfn": "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root", "runs": {}}   <----- This parent file contains the run/lumi {"198271": [749]}, so why it was not written here?
]},
"logs": {},
"output": {"analysis": [], "output1": [
{"branch_hash": "536a5ef195df9f8c657c867ef96c6dfa", "catalog": "", "checksums": {"adler32": "cf36c20", "cksum": "1252083290"}, "events": 919, "guid": "D6D238A7-4194-E511-BAD6-00261894396D", "input": ["/store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root", "/store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root", "/store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root", "/store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root"], "inputpfns": ["root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root", "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/041/D48DD19E-0DC4-E111-A80C-BCAEC518FF8D.root", "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/A60FA59A-60C6-E111-83EA-5404A63886BD.root", "root://cmsxrootd-site.fnal.gov//store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root"], "lfn": "", "local_stageout": true, "module_label": "output1", "ouput_module_class": "PoolOutputModule", "pfn": "output1.root", "pset_hash": "7d4851d47bb073facc0d96017716ee10", "runs": {"198271": [749]}, "size": 38138389, "storage_site": "T2_US_Nebraska", "temp_storage_site": "T1_US_FNAL_Disk"}
]},
...

The parent files that were opened/closed without analyzing any lumi were anyway written into the FJR and with the wrong "runs" information. But in this case the upload of the input files metadata didn't fail, because the "runs" information didn't happen to be structurally incorrect as it did happen for the two users in this issue. Another consequence of having parent files in the FJR for which no lumis were analyzed is that 'crab report' will count these files when reporting how many files have been processed:

4 files have been processed   <----- Should be 2 files (the primary file and the parent file; both jobs analyzed the same two files)
3794 events have been read    <----- This number is correct, because the "extra" parents files have {"events": 0} in the FJR.
1897 events have been written
Analyzed luminosity sections written to /afs/cern.ch/work/a/atanasi/crab3/CMSSW_7_3_5_patch2/src/crab_test_parents_2/results/lumiSummary.json
Log file is /afs/cern.ch/work/a/atanasi/crab3/CMSSW_7_3_5_patch2/src/crab_test_parents_2/crab.log

I ran this task with CMSSW_7_3_5_patch2. I ran a similar task with CMSSW_7_4_12_patch4 obtaining the same result.

Given this test, I am pretty much convinced that the solution is to modify the list of input files that CRAB assigns to each job in a way to not include the secondary files that have no lumis to be analyzed. In my example above this translates into removing the first two parent files from the list of input files to the two jobs, leaving only these two input files:

primary input file: "/store/data/Run2012C/SingleMu/AOD/24Aug2012-v1/00000/003B900F-E9EE-E111-834F-00259073E3A0.root"
secondary input (parent) file: "/store/data/Run2012C/SingleMu/RAW/v1/000/198/271/E2EEB799-60C6-E111-B84C-0025901D631E.root"

It could be that a fix in CMSSW is also needed, but I still think we should not put secondary files that have no lumis to analyze in the list of input files of the jobs.

wmtan commented 8 years ago
I just finished deciphering the FJR code.

Unfortunately, the current job report code does not guarantee the accuracy of run or lumi information from secondary input files.

What happens is that whenever a run or a lumi is read from the *primary* input file, the FJR associates the run or lumi with every secondary file that is open at the time.

There is no check whatever that the reported run or lumi is actually in the secondary file. Furthermore, there is no guarantee from the FJR that the secondary file containing the run or lumi is actually open when the run or lumi is read from the primary file.

So, the FJR for a secondary file could, in principle, contain runs or lumis not in the file, and/or not contain runs or lumis that are in the file.

So, as Chris Jones said, we have two alternatives:

1) Don't report run or lumi information for secondary files.

2) Change the reporting of runs and lumis for secondary files so that they are reported when the run or lumi is actually read from that secondary file, rather than from the primary file.

The report call would need to specify the file to which the run/lumi is associated.

Neither is trivial, but 2) is quite a bit more work than 1).

Rough estimate of coding time (not including testing).

1) 0.5 days

2) 3 days (could be more) 
belforte commented 8 years ago

Thanks @wmtan . But since that is being debated in: https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3604.html

no need to repost here. We can simply put the conclusion (which appears we reached by now) from that HN thread.

belforte commented 8 years ago

I changed to a BUG since in light of what @wmtan wrote, we are currently using unreliable information from FJR. No use in second guessing it, we should pitch that info. @AndresTanasijczuk I assigned to you since you already started to work on this, if you do not have the time to complete before leaving, we'll reassign next year.