dmwm / CRABServer

16 stars 38 forks source link

files from same tasks are published in different datasets #5603

Closed belforte closed 7 years ago

belforte commented 7 years ago

see https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/3334/1/2/1.html

is not due to resubmissions job 10 for that task was OK but e.g. 3969 indicates a different has in the output dataset destination in the postjob :-(

belforte commented 7 years ago

or maybe I am confused by two output datasets. Need to be very careful here.

belforte commented 7 years ago

so far:

belforte commented 7 years ago

examples from PostJob.log by the way the task is 171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02

https://cmsweb.cern.ch/crabserver/ui/task/171031_160831%3Asantocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02

GOOD JOB (#10): https://cmsweb.cern.ch/scheddmon/0120/cms574/171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/postjob.10.0.txt

Wed, 01 Nov 2017 00:45:28 CET(+0000):INFO:PostJob ====== Finished to check uploads to ASO database.
Wed, 01 Nov 2017 00:45:28 CET(+0000):INFO:PostJob Defering the execution of the post-job.
Ended TaskManagerBootstrap with code 4
Wed, 01 Nov 2017 01:15:28 CET(+0000):DEBUG:PostJob (Parsed job ad.)
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob Using the cache since it is up to date (last_query=1509495232.29) and it is after we submitted the transfer (aso_start_timestamp=1509493528)
Wed, 01 Nov 2017 01:15:28 CET(+0000):DEBUG:PostJob Using cached ASO results.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob Got statuses: done, done, done; 0.5 hours since transfer submit.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob All transfers finished.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob ====== Finished to monitor ASO transfers.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob ====== Finished to check for ASO transfers.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob ====== Starting upload of output files metadata.
Wed, 01 Nov 2017 01:15:28 CET(+0000):DEBUG:PostJob Uploading output metadata for EXO-RunIISummer15wmLHEGS-01667.root to https://cmsweb.cern.ch/crabserver/prod/filemetadata: [('outlfn', '/store/user/santocch/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_3969.root'), ('outdatasetname', u'/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/santocch-crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_RAWSIMoutput-4cf19b8891b49f6b5b98c3052289aa17/USER'), ('globalTag', 'None'), ('checksumadler32', u'30a7acfd'), ('outsize', 720358813), ('checksumcksum', u'3723168711'), ('outlocation', 'T2_IT_Bari'), ('jobid', '3969'), ('outtmplfn', '/store/temp/user/santocch.68589efd54e6b2f504806adbb10bd291f89ff9c4/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_3969.root'), ('appver', 'CMSSW_7_1_25_patch2'), ('outtype', 'EDM'), ('checksummd5', 'asda'), ('publishdataname', u'crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_RAWSIMoutput-4cf19b8891b49f6b5b98c3052289aa17'), ('taskname', '171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02'), ('acquisitionera', 'null'), ('outtmplocation', u'T2_RU_JINR'), ('events', 1235), ('directstageout', 0), ('outfileruns', '1'), ('outfilelumis', '198422:None,198420:None,198450:None,198401:None,198402:None,198423:None,198435:None,198434:None,198437:None,198436:None,198431:None,198430:None,198419:None,198418:None,198417:None,198416:None,198415:None,198414:None,198413:None,198412:None,198411:None,198410:None,198433:None,198432:None,198403:None,198421:None,198444:None,198445:None,198446:None,198447:None,198440:None,198441:None,198442:None,198443:None,198439:None,198448:None,198449:None,198438:None,198426:None,198427:None,198424:None,198425:None,198404:None,198405:None,198406:None,198407:None,198408:None,198409:None,198428:None,198429:None')]
Wed, 01 Nov 2017 01:15:29 CET(+0000):DEBUG:PostJob Uploading output metadata for EXO-RunIISummer15wmLHEGS-01667_inLHE.root to https://cmsweb.cern.ch/crabserver/prod/filemetadata: [('outlfn', '/store/user/santocch/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_inLHE_3969.root'), ('outdatasetname', u'/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/santocch-crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_LHEoutput-4cf19b8891b49f6b5b98c3052289aa17/USER'), ('globalTag', 'None'), ('checksumadler32', u'1b511b14'), ('outsize', 5442368), ('checksumcksum', u'406707745'), ('outlocation', 'T2_IT_Bari'), ('jobid', '3969'), ('outtmplfn', '/store/temp/user/santocch.68589efd54e6b2f504806adbb10bd291f89ff9c4/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_inLHE_3969.root'), ('appver', 'CMSSW_7_1_25_patch2'), ('outtype', 'EDM'), ('checksummd5', 'asda'), ('publishdataname', u'crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_LHEoutput-4cf19b8891b49f6b5b98c3052289aa17'), ('taskname', '171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02'), ('acquisitionera', 'null'), ('outtmplocation', u'T2_RU_JINR'), ('events', 5000), ('directstageout', 0), ('outfileruns', '1'), ('outfilelumis', '198422:None,198420:None,198450:None,198401:None,198402:None,198423:None,198435:None,198434:None,198437:None,198436:None,198431:None,198430:None,198419:None,198418:None,198417:None,198416:None,198415:None,198414:None,198413:None,198412:None,198411:None,198410:None,198433:None,198432:None,198403:None,198421:None,198444:None,198445:None,198446:None,198447:None,198440:None,198441:None,198442:None,198443:None,198439:None,198448:None,198449:None,198438:None,198426:None,198427:None,198424:None,198425:None,198404:None,198405:None,198406:None,198407:None,198408:None,198409:None,198428:None,198429:None')]
Wed, 01 Nov 2017 01:15:31 CET(+0000):INFO:PostJob ====== Finished upload of output files metadata.

BAD JOB (#3969) https://cmsweb.cern.ch/scheddmon/0120/cms574/171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/postjob.3969.0.txt

Wed, 01 Nov 2017 00:45:28 CET(+0000):INFO:PostJob ====== Finished to check uploads to ASO database.
Wed, 01 Nov 2017 00:45:28 CET(+0000):INFO:PostJob Defering the execution of the post-job.
Ended TaskManagerBootstrap with code 4
Wed, 01 Nov 2017 01:15:28 CET(+0000):DEBUG:PostJob (Parsed job ad.)
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob Using the cache since it is up to date (last_query=1509495232.29) and it is after we submitted the transfer (aso_start_timestamp=1509493528)
Wed, 01 Nov 2017 01:15:28 CET(+0000):DEBUG:PostJob Using cached ASO results.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob Got statuses: done, done, done; 0.5 hours since transfer submit.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob All transfers finished.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob ====== Finished to monitor ASO transfers.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob ====== Finished to check for ASO transfers.
Wed, 01 Nov 2017 01:15:28 CET(+0000):INFO:PostJob ====== Starting upload of output files metadata.
Wed, 01 Nov 2017 01:15:28 CET(+0000):DEBUG:PostJob Uploading output metadata for EXO-RunIISummer15wmLHEGS-01667.root to https://cmsweb.cern.ch/crabserver/prod/filemetadata: [('outlfn', '/store/user/santocch/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_3969.root'), ('outdatasetname', u'/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/santocch-crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_RAWSIMoutput-4cf19b8891b49f6b5b98c3052289aa17/USER'), ('globalTag', 'None'), ('checksumadler32', u'30a7acfd'), ('outsize', 720358813), ('checksumcksum', u'3723168711'), ('outlocation', 'T2_IT_Bari'), ('jobid', '3969'), ('outtmplfn', '/store/temp/user/santocch.68589efd54e6b2f504806adbb10bd291f89ff9c4/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_3969.root'), ('appver', 'CMSSW_7_1_25_patch2'), ('outtype', 'EDM'), ('checksummd5', 'asda'), ('publishdataname', u'crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_RAWSIMoutput-4cf19b8891b49f6b5b98c3052289aa17'), ('taskname', '171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02'), ('acquisitionera', 'null'), ('outtmplocation', u'T2_RU_JINR'), ('events', 1235), ('directstageout', 0), ('outfileruns', '1'), ('outfilelumis', '198422:None,198420:None,198450:None,198401:None,198402:None,198423:None,198435:None,198434:None,198437:None,198436:None,198431:None,198430:None,198419:None,198418:None,198417:None,198416:None,198415:None,198414:None,198413:None,198412:None,198411:None,198410:None,198433:None,198432:None,198403:None,198421:None,198444:None,198445:None,198446:None,198447:None,198440:None,198441:None,198442:None,198443:None,198439:None,198448:None,198449:None,198438:None,198426:None,198427:None,198424:None,198425:None,198404:None,198405:None,198406:None,198407:None,198408:None,198409:None,198428:None,198429:None')]
Wed, 01 Nov 2017 01:15:29 CET(+0000):DEBUG:PostJob Uploading output metadata for EXO-RunIISummer15wmLHEGS-01667_inLHE.root to https://cmsweb.cern.ch/crabserver/prod/filemetadata: [('outlfn', '/store/user/santocch/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_inLHE_3969.root'), ('outdatasetname', u'/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/santocch-crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_LHEoutput-4cf19b8891b49f6b5b98c3052289aa17/USER'), ('globalTag', 'None'), ('checksumadler32', u'1b511b14'), ('outsize', 5442368), ('checksumcksum', u'406707745'), ('outlocation', 'T2_IT_Bari'), ('jobid', '3969'), ('outtmplfn', '/store/temp/user/santocch.68589efd54e6b2f504806adbb10bd291f89ff9c4/DYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX/crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/171031_160831/0003/EXO-RunIISummer15wmLHEGS-01667_inLHE_3969.root'), ('appver', 'CMSSW_7_1_25_patch2'), ('outtype', 'EDM'), ('checksummd5', 'asda'), ('publishdataname', u'crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_LHEoutput-4cf19b8891b49f6b5b98c3052289aa17'), ('taskname', '171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02'), ('acquisitionera', 'null'), ('outtmplocation', u'T2_RU_JINR'), ('events', 5000), ('directstageout', 0), ('outfileruns', '1'), ('outfilelumis', '198422:None,198420:None,198450:None,198401:None,198402:None,198423:None,198435:None,198434:None,198437:None,198436:None,198431:None,198430:None,198419:None,198418:None,198417:None,198416:None,198415:None,198414:None,198413:None,198412:None,198411:None,198410:None,198433:None,198432:None,198403:None,198421:None,198444:None,198445:None,198446:None,198447:None,198440:None,198441:None,198442:None,198443:None,198439:None,198448:None,198449:None,198438:None,198426:None,198427:None,198424:None,198425:None,198404:None,198405:None,198406:None,198407:None,198408:None,198409:None,198428:None,198429:None')]
Wed, 01 Nov 2017 01:15:31 CET(+0000):INFO:PostJob ====== Finished upload of output files metadata.
belforte commented 7 years ago

they actually lookOK to me now. Both job have the same hex hash string for the output dataset 4cf19b8891b49f6b5b98c3052289aa17 but crab status points to e6d8e5892aba7c71313869af36f313aa as in https://cmsweb.cern.ch/das/request?input=%2FDYJetsToLL_012j_Zpt-20ToInf_m10to50_5f_NLO_FXFX%2Fsantocch-crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_LHEoutput-e6d8e5892aba7c71313869af36f313aa%2FUSER&instance=prod%2Fphys03

belforte commented 7 years ago

I am getting a bit lost :-( better check in the schedd. SPOOL_DIR is (on vocms0120 ) /home/grid/cms574/171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/SPOOL_DIR

belforte commented 7 years ago

much better:

[root@vocms0120 SPOOL_DIR]# grep "output metadata" postjob.10.0.txt|grep RAW|tr '(' '\n'|grep publish
'publishdataname', u'crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_RAWSIMoutput-e6d8e5892aba7c71313869af36f313aa'), 
[root@vocms0120 SPOOL_DIR]# grep "output metadata" postjob.3969.0.txt|grep RAW|tr '(' '\n'|grep publish
'publishdataname', u'crab_EXO-RunIISummer15wmLHEGS-01667_generation_02_RAWSIMoutput-4cf19b8891b49f6b5b98c3052289aa17'), 
[root@vocms0120 SPOOL_DIR]# 

so they ARE different

belforte commented 7 years ago

indeed earlier I had pasted postjob log for job 3969 twice, also as job #10... oh well

mmascher commented 7 years ago

That hash value is read from the pset_hash value of the fjr. Indeed it is different there as well. Checking the job out to see if I can spot what caused the difference.

[mmascher@vocms0120 SPOOL_DIR]$ grep pset_hash job_fjr.10.0.json {"log_size": 345951, "task": null, "PrepID": "", "skippedFiles": [], "Campaign": "", "phedex_node": "T2_US_Vanderbilt", "temp_storage_site": "T2_US_Vanderbilt", "exitMsg": "OK", "executed_site": "T2_US_Vanderbilt", "storage_site": "T2_IT_Bari", "steps": {"cmsRun": {"status": 1, "errors": [], "logs": {}, "parameters": {}, "output": {"RAWSIMoutput": [{"runs": {"1": {"458": null, "459": null, "451": null, "452": null, "453": null, "454": null, "455": null, "456": null, "457": null, "498": null, "499": null, "494": null, "495": null, "496": null, "497": null, "490": null, "491": null, "492": null, "493": null, "500": null, "469": null, "468": null, "465": null, "464": null, "467": null, "466": null, "461": null, "460": null, "463": null, "462": null, "489": null, "488": null, "487": null, "486": null, "485": null, "484": null, "483": null, "482": null, "481": null, "480": null, "472": null, "473": null, "470": null, "471": null, "476": null, "477": null, "474": null, "475": null, "478": null, "479": null}}, "temp_storage_site": "T2_US_Vanderbilt", "events": 1222, "ouput_module_class": "PoolOutputModule", "branch_hash": "4480b247f2366fd960a5d0f0d95057bb", "pset_hash": "e6d8e5892aba7c71313869af36f313aa", "lfn": "", "pfn": "EXO-RunIISummer15wmLHEGS-01667.root", "catalog": "", "module_label": "RAWSIMoutput", "local_stageout": true, "checksums": {"adler32": "40612719", "cksum": "432815835"}, "guid": "6C15A670-56BE-E711-B7F0-842B2B6F8595", "storage_site": "T2_IT_Bari", "size": 719346104}], "LHEoutput": [{"runs": {"1": {"458": null, "459": null, "451": null, "452": null, "453": null, "454": null, "455": null, "456": null, "457": null, "498": null, "499": null, "494": null, "495": null, "496": null, "497": null, "490": null, "491": null, "492": null, "493": null, "500": null, "469": null, "468": null, "465": null, "464": null, "467": null, "466": null, "461": null, "460": null, "463": null, "462": null, "489": null, "488": null, "487": null, "486": null, "485": null, "484": null, "483": null, "482": null, "481": null, "480": null, "472": null, "473": null, "470": null, "471": null, "476": null, "477": null, "474": null, "475": null, "478": null, "479": null}}, "lfn": "", "ouput_module_class": "PoolOutputModule", "branch_hash": "baad18a51a4cbab54f85db48954b2717", "pset_hash": "e6d8e5892aba7c71313869af36f313aa", "temp_storage_site": "T2_US_Vanderbilt", "pfn": "EXO-RunIISummer15wmLHEGS-01667_inLHE.root", "catalog": "", "module_label": "LHEoutput", "local_stageout": true, "guid": "D4929370-56BE-E711-B7F0-842B2B6F8595", "checksums": {"adler32": "a517b063", "cksum": "3784311939"}, "storage_site": "T2_IT_Bari", "events": 5000, "size": 5452130}], "analysis": []}, "stop": null, "site": {}, "analysis": {}, "start": null, "cleanup": {}, "input": {}, "performance": {"multicore": {}, "storage": {"writeTotalMB": 691.22199999999998, "readPercentageOps": 1.0, "readAveragekB": 0.48862917668730493, "readTotalMB": 11936.6, "readNumOps": 25015036.0, "readCachePercentageOps": 0.0, "readMBSec": 254.66539511048248, "readMaxMSec": 1845.4200000000001, "readTotalSecs": 0.0, "writeTotalSecs": 0.97112900000000002}, "cpu": {"TotalJobCPU": "38858.7", "AvgEventCPU": "7.72408", "MaxEventTime": "127.427", "AvgEventTime": "10.0758", "MinEventCPU": "0.001999", "TotalEventCPU": "38620.4", "TotalJobTime": "50379", "MinEventTime": "0.00164485", "MaxEventCPU": "126.744"}, "memory": {"PeakValueRss": "764.785", "PeakValueVsize": "1231.62"}}}}, "fallbackFiles": [], "postjob": {"exitMsg": "", "exitCode": 0}, "exitAcronym": "OK", "exitCode": 0, "local_stageout": true, "jobExitCode": 0} [mmascher@vocms0120 SPOOL_DIR]$ grep pset_hash job_fjr.3969.0.json {"log_size": 354054, "task": null, "PrepID": "", "skippedFiles": [], "Campaign": "", "phedex_node": "T2_RU_JINR", "temp_storage_site": "T2_RU_JINR", "exitMsg": "OK", "executed_site": "T2_RU_JINR", "storage_site": "T2_IT_Bari", "steps": {"cmsRun": {"status": 1, "errors": [], "logs": {}, "parameters": {}, "output": {"RAWSIMoutput": [{"runs": {"1": {"198404": null, "198450": null, "198427": null, "198424": null, "198423": null, "198435": null, "198434": null, "198437": null, "198436": null, "198431": null, "198430": null, "198419": null, "198418": null, "198417": null, "198416": null, "198415": null, "198414": null, "198413": null, "198412": null, "198411": null, "198410": null, "198406": null, "198425": null, "198433": null, "198432": null, "198407": null, "198444": null, "198445": null, "198446": null, "198447": null, "198440": null, "198441": null, "198442": null, "198443": null, "198439": null, "198448": null, "198449": null, "198438": null, "198426": null, "198401": null, "198402": null, "198403": null, "198422": null, "198405": null, "198420": null, "198421": null, "198408": null, "198409": null, "198428": null, "198429": null}}, "temp_storage_site": "T2_RU_JINR", "events": 1235, "ouput_module_class": "PoolOutputModule", "branch_hash": "4480b247f2366fd960a5d0f0d95057bb", "pset_hash": "4cf19b8891b49f6b5b98c3052289aa17", "lfn": "", "pfn": "EXO-RunIISummer15wmLHEGS-01667.root", "catalog": "", "module_label": "RAWSIMoutput", "local_stageout": true, "checksums": {"adler32": "30a7acfd", "cksum": "3723168711"}, "guid": "50B05726-5CBE-E711-9929-0CC47AE0F33C", "storage_site": "T2_IT_Bari", "size": 720358813}], "LHEoutput": [{"runs": {"1": {"198404": null, "198450": null, "198427": null, "198424": null, "198423": null, "198435": null, "198434": null, "198437": null, "198436": null, "198431": null, "198430": null, "198419": null, "198418": null, "198417": null, "198416": null, "198415": null, "198414": null, "198413": null, "198412": null, "198411": null, "198410": null, "198406": null, "198425": null, "198433": null, "198432": null, "198407": null, "198444": null, "198445": null, "198446": null, "198447": null, "198440": null, "198441": null, "198442": null, "198443": null, "198439": null, "198448": null, "198449": null, "198438": null, "198426": null, "198401": null, "198402": null, "198403": null, "198422": null, "198405": null, "198420": null, "198421": null, "198408": null, "198409": null, "198428": null, "198429": null}}, "lfn": "", "ouput_module_class": "PoolOutputModule", "branch_hash": "baad18a51a4cbab54f85db48954b2717", "pset_hash": "4cf19b8891b49f6b5b98c3052289aa17", "temp_storage_site": "T2_RU_JINR", "pfn": "EXO-RunIISummer15wmLHEGS-01667_inLHE.root", "catalog": "", "module_label": "LHEoutput", "local_stageout": true, "guid": "DCED5126-5CBE-E711-9929-0CC47AE0F33C", "checksums": {"adler32": "1b511b14", "cksum": "406707745"}, "storage_site": "T2_IT_Bari", "events": 5000, "size": 5442368}], "analysis": []}, "stop": null, "site": {}, "analysis": {}, "start": null, "cleanup": {}, "input": {}, "performance": {"multicore": {}, "storage": {"writeTotalMB": 692.178, "readPercentageOps": 1.0, "readAveragekB": 0.48853512814446909, "readTotalMB": 11890.299999999999, "readNumOps": 24922804.0, "readCachePercentageOps": 0.0, "readMBSec": 699.83284481642352, "readMaxMSec": 7.8873600000000001, "readTotalSecs": 0.0, "writeTotalSecs": 0.56188499999999997}, "cpu": {"TotalJobCPU": "19169.4", "AvgEventCPU": "3.80674", "MaxEventTime": "75.6407", "AvgEventTime": "4.9276", "MinEventCPU": "0.000999", "TotalEventCPU": "19033.7", "TotalJobTime": "24638", "MinEventTime": "0.00121617", "MaxEventCPU": "75.6865"}, "memory": {"PeakValueRss": "875.453", "PeakValueVsize": "1234.43"}}}}, "fallbackFiles": [], "postjob": {"exitMsg": "", "exitCode": 0}, "exitAcronym": "OK", "exitCode": 0, "local_stageout": true, "jobExitCode": 0}

mmascher commented 7 years ago

These hashes are coming from edmProvDump, see the logs here:

https://cmsweb.cern.ch/scheddmon/0120/cms574/171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/job_out.3969.0.txt

https://cmsweb.cern.ch/scheddmon/0120/cms574/171031_160831:santocch_crab_EXO-RunIISummer15wmLHEGS-01667_generation_02/job_out.10.0.txt

Not sure what is wrong that makes the hash different though.. Maybe it is not CRAB3 fault this time?

belforte commented 7 years ago

I find it appealing to bump it back to the user :-)

You changed the provenance information, say thanks to CRAB for having pointed it out !

I am sure he can say "I do not know what EDM provenance is and I do not give a damn..." but, still :-)

belforte commented 7 years ago

maybe Singularity ? that's one thing different between those two jobs, resulting, if nothing else, in two different locations for the directory where the gridpack is located. Maybe this is tracked in full here ? this is a fragment from the pset as in the crab task cache

process.externalLHEProducer = cms.EDProducer("ExternalLHEProducer",
    nEvents = cms.untracked.uint32(160),
    outputFile = cms.string('cmsgrid_final.lhe'),
    scriptName = cms.FileInPath('GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh'),
    numberOfParameters = cms.uint32(1),
    args = cms.vstring('localgridpack')
)

it is intriguing that sites with Singularity would have a constant location: /srv/CMSSW_7_1_25_patch2/src/data/DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz while the others will usully have the pilot name in the directory path, resulting in a different hash per job.

santocch commented 7 years ago

I do not know what EDM provenance is and I do not give a damn...

:-)

Seriously... can you explain in simple words what's wrong with my config file? What I'm doing in the script I'm using (called by crab) is: 1) wget the gridpack from dropbox and put it in $CMSSW_BASE/src/data 2) put the correct path in the PSet.pkl... The commands I use are the following:

wget -O $CMSSW_BASE/src/data/DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz https://www.dropbox.com/s/96zl5kuwo4sttv1/DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz?dl=1
MY_GRIDPACK_LOCATION=$CMSSW_BASE/src/data/DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz
sed -i -e "s|localgridpack|${MY_GRIDPACK_LOCATION}|g" ./PSet.pkl
cmsRun -j FrameworkJobReport.xml PSet.py $MY_GRIDPACK_LOCATION

I spent many hours looking for a solution... there are requests in HN but I didn't get any help and that's the way I found working.

santocch commented 7 years ago

The problem is simple: how to ship to the WN a big files needed by cmssw? See here [0] and here [1] for the discussion. [0] https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1868.html [1] https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/3095/1/2/2/1.html

belforte commented 7 years ago

@santocch Try this in your pset.py change 'args = cms.vstring('localgridpack')' to 'args = cms.untracked.vstring('localgridpack')'

run a couple jobs at non singularity sites (e.g. whitelist Italy) and let's check the edmProvDump pset hash in the stdout can also compare with the current situation

The problem with your config file is that it tells CMSSW that this args is an importatnt configuration parameter whose value can affect the code output, and surely it is, but I guess you only wanted to track the file name, not the full path. An alternative is of course to copy the tar in the PWD so that in the pset you only put the file name !

santocch commented 7 years ago

Not good... CMSSW don't like it... all 3 jobs failed:

== CMSSW: In the configuration, parameter "args" is defined as an untracked vstring. == CMSSW: It should be tracked.

santocch commented 7 years ago

And also your second suggestion (If I read it correctly) don't work... putting just the file name gives the following error (of course the file is in the same test directory where the cmssw config file is and I run cmsRun):

gridpack tarball path = DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz
%MSG-MG5 number of events requested = 160
%MSG-MG5 random seed used for the run = 754933631
tar: DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw-patch/CMSSW_7_1_25_patch2/src/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh: line 37: ./runcmsgrid.sh: No such file or directory
mv: cannot stat `cmsgrid_final.lhe': No such file or directory
----- Begin Fatal Exception 07-Nov-2017 21:40:53 CET-----------------------
An exception of category 'OutputOpenError' occurred while
   [0] Processing run: 1
   [1] Calling beginRun for unscheduled module ExternalLHEProducer/'externalLHEProducer'
Exception Message:
Unable to open script output file cmsgrid_final.lhe.
----- End Fatal Exception -------------------------------------------------
belforte commented 7 years ago

maybe 'args = cms.untracked.vstring('./somfilename')' ? Of course if CMSSW developer made their best to make sure that the gridpack full location is tracked as relevant for the physics.... you may not find a way to beat them and have to bring this up with them. You can find a description of tracked vs. untracked parameters in here https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideAboutPythonConfigFile#Parameters

The idea is that the EDM Provenance makes it sure that configuration was not changed in relevant ways when producing data. And CRAB puts data with different provenance in different datasets exactly because they are expected not to be homogeneous.

Technically CRAB could surely be told to ignore provenance (need some devel. work though) but then the problem is for those users who change a config. file, run again, and w/o knowing it mix things in their output. Clearly here there's not clear technical best, it is a matter of judgement and opinions. I can only suggest to involve EDM developers and whatever you agree on, we will do. In the end the goal is to help users to do the physics and you are ideally placed to have responsibility for such choices as this one.

santocch commented 7 years ago

Just posted [0] a request to the EDM and Framework HY

[0] https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3801.html

belforte commented 7 years ago

In the end, clearly CRAB did what it is expected to do. No bug here. As to how best to support the use case of non-centrally-located gridpacks, let's defer to the debate that @santocch opend in EDM HN (see comment above), my preference is for that to be accomodated inside CMSSW or a custom EDM module, rather than via relaxing the link between dataset name and EDM provenance in CRAB

PerilousApricot commented 7 years ago

Uh, this use case is kinda ... weird? @santocch is trying to run jobs that wget 100MByte/job:

== CMSSW: --2017-10-31 19:53:15--  https://www.dropbox.com/s/96zl5kuwo4sttv1/DYJetsToLL_012j_Zpt-20ToInf_5f_NLO_FXFX_m10to50_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz?dl=1
<snip>
== CMSSW:  48950K .......... .......... .......... .......... .......... 27%  129M 3s

So, either:

1) They're not running a lot of jobs, so they could run it on a regular batch system (w/o crab) 2) They're running a ton of jobs and dropbox will cut them off for too many downloads, then every next job will fail.

PerilousApricot commented 7 years ago

Provenance matters for sure, but if you're doing private monte carlo, I don't think there's a good way to ship a 100MB tarball to every job without someone being upset about it.

belforte commented 7 years ago

+1

PerilousApricot commented 7 years ago

FWIW: we changed scram to not check the "production release" on non-interactive shells because that was DDOS-ing CERN, and that xml file was like 100kb....

santocch commented 7 years ago

The use case is the following:

1) I'd like to produce 10M events (but I could need more) 2) The gridpack is 170MB 3) A single job for the 1st step (LHE and RAWSIM) last more or less between 12 and 18 hours for 5000 events 4) I cant use more events for a single job because the allowed wall time for grid jobs is 24h 5) The matching procedure has an efficiency of more or less 25% so at the end of the 1step I have ~1250 events/job --> I need more or less 8000 jobs to complete the original task

I already sent more or less 15M events (but the efficiency is lower... because I have less events than expected in the final rawsim dataset... and that's what triggered this discussion) and for the moment dropbox doesn't complain... I know I could be cut off by them... any suggestion is again very wellcome

Attilio

PerilousApricot commented 7 years ago

The problem you're hitting in (2), (3), (4), (5) is you're trying to do too many things in too few steps. Split it up into separate steps that parallelize easily:

1) Generate LHE on your favorite batch system. The gridpack will be local and you can run an arbitrary number of jobs for as little time as you want 2) (optional) convert the LHE to EDM files and merge them 3) GEN-SIM-RAW on CRAB 4) RECO on CRAB etc....

I've produced 1M events for 26 different mass points (ISR Stau fastsim) over a weekend this way, and I screwed up halfway through.

belforte commented 7 years ago

24h is the defaul, but you can ask for up to 46h. And if you can run faster by going multicore, that'd be even more welcome. But a factor 2 in number of jobs will not drastically change things. Distribution of gridpacks to grid WN's is a fine topic, but not a CRAB Server one. If you really (maybe Andrew had the right solution here?) need to send a 100MB+ file to each job, there is no good way to do it. Best is to put that file on your favorite T2 and run all (most) jobs there.

PerilousApricot commented 7 years ago

Of course, 48h means a random sneeze on a worker node means you lose a 2 days of work :)

santocch commented 7 years ago

I agree... but when I started I didn't expect all these problems... but I miss a critical step... let's say I produce the LHE locally... and I have 40M LHE events locally in 400 files 100MB each. These files are not published in dbs... isn't the same problem? How can I ship these files to the worked nodes?

PerilousApricot commented 7 years ago

The size of the file isn't the problem -- 100MByte is nothing! (files produced centrally are between 2 and 4GByte) The problem is how to get those bytes from A to B.

As you observed, central production has spent a lot of time optimizing that process via (e.g.) CVMFS. Private MC production (unfortunately) can't take advantage of those optimizations, so you have to do things somewhat differently for the first step.

belforte commented 7 years ago

IIUC, Andrew idea is:

  1. run CRAB at your T2 reading gridpack locally, produce 400 file 100MB each (40GB total) and publish in DBS. THis is very fast.
  2. run CRAB over the grid for the CPU intensive part, reading those files via xrootd and put output in DBS this takes time and many job, so your T2 has to do some decent jobs with xrootd serving, but the overall rate at which data is pulled out is not a problem

Of course the best option is to have central production produce the useful gridpacks :-)

belforte commented 7 years ago

to close the CRAB thread, I have verified that all the jobs which lead to the "one file per dataset" ran on non-Singularity WN's, while all jobs which published in the same dataset ran on Singularity WN. So this definitely confirms that the only thing which happened is that when EDM provenance changed, CRAB put the data in a different dataset.

I hope that Attilio has done a small scale check of the process before sending Nx5000 jobs, so if the ptoblem was not spotted there it likely means that those tasks accidentally ran all on Singularity.

In my view the real problem here is that we had a series of issues due to recent changes in the CRAB code so thought that this could be the indication of another issue. Sorting out things with large tasks takes time. If the problem had been reported in a 5 job task, it would have been much easier.