cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

Fatal Root Error: @SUB=TBasket::ReadBasketBuffers #34393

Open haozturk opened 3 years ago

haozturk commented 3 years ago

Hi all,

With one production workflow, we are having the following issue while reading input files:

[a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

Workflow: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=ReReco-Run2017E-DoubleMuon-UL2017_MiniAODv2-00002 Job log: https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_ACDC1_Run2017E_DoubleMuon_UL2017_MiniAODv2_210604_153724_982/8021/DataProcessing/2a7c2fd2-3cbc-47b8-ad9b-9a3017657f50-0-2-logArchive/

We discussed the issue on JIRA and Matti suggested opening this issue. https://its.cern.ch/jira/browse/CMSCOMPPR-18887

Many thanks, Hasan

cmsbuild commented 3 years ago

A new Issue was created by @haozturk Hasan ztrk.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 3 years ago

@pcanal thoughts?

makortel commented 3 years ago

assign core

cmsbuild commented 3 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 3 years ago

A bit more context from the log

Begin processing the 28654th record. Run 304292, Event 2152032250, LumiSection 1561 on stream 6 at 05-Jun-2021 02:42:57.484 CEST
R__unzipLZMA: error 9 in lzma_code
Begin processing the 28655th record. Run 304292, Event 2152271290, LumiSection 1561 on stream 7 at 05-Jun-2021 02:42:57.915 CEST
%MSG-e BuilderPluginException:   RecoTauPiZeroProducer:ak4PFJetsLegacyHPSPiZerosBoosted  05-Jun-2021 02:42:57 CEST Run: 304292 Event: 2152134202
Exception caught in builder plugin s, rethrowing
%MSG
----- Begin Fatal Exception 05-Jun-2021 02:42:57 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
%MSG-e BuilderPluginException:   PFRecoTauChargedHadronProducer:ak4PFJetsRecoTauChargedHadronsBoosted  05-Jun-2021 02:42:58 CEST Run: 304292 Event: 2152134202
Exception caught in builder plugin chargedPFCandidates, rethrowing
%MSG
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'Flag_BadPFMuonFilter'
   [2] Calling method for module BadParticleFilter/'BadPFMuonFilter'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::Track>' with ProductID '2:3564'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 05-Jun-2021 02:42:58 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 304292 lumi: 1561 event: 2152134202 stream: 4
   [1] Running path 'MINIAODoutput_step'
   [2] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module CandIsolatorFromDeposits/'muPFIsoValueCharged04PAT'
   [7] Prefetching for module CandIsoDepositProducer/'muPFIsoDepositChargedPAT'
   [8] Prefetching for module PFCandidateFwdPtrCollectionPdgIdFilter/'pfAllChargedHadronsPFBRECO'
   [9] Prefetching for module TPPFCandidatesOnPFCandidates/'pfNoPileUpIsoPFBRECO'
   [10] Prefetching for module PFPileUp/'pfPileUpIsoPFBRECO'
   [11] Calling method for module PFCandidateFwdPtrProducer/'particleFlowPtrs'
   [12] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [13] Running path 'MINIAODoutput_step'
   [14] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [15] Prefetching for module PATMETSlimmer/'slimmedMETs'
   [16] Prefetching for module CorrectedPATMETProducer/'patPFMetT1TauEnUp'
   [17] Prefetching for module ShiftedParticleMETcorrInputProducer/'shiftedPatMETCorrTauEnUp'
   [18] Prefetching for module PATTauRefSelector/'pfTaus'
   [19] Prefetching for module PATTauSelector/'selectedPatTaus'
   [20] Calling method for module PATTauProducer/'patTaus'
   [21] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [22] Running path 'MINIAODoutput_step'
   [23] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [24] Prefetching for module PATElectronSlimmer/'slimmedLowPtElectrons'
   [25] Prefetching for module PATElectronSelector/'selectedPatLowPtElectrons'
   [26] Prefetching for module PATElectronProducer/'patLowPtElectrons'
   [27] Calling method for module LowPtGsfElectronIDProducer/'lowPtGsfElectronID'
   [28] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [29] Running path 'MINIAODoutput_step'
   [30] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [31] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [32] Calling method for module MuonReducedTrackExtraProducer/'slimmedMuonTrackExtras'
   [33] Processing  Event run: 304292 lumi: 1561 event: 2152669940 stream: 0
   [34] Running path 'MINIAODoutput_step'
   [35] Prefetching for module PoolOutputModule/'MINIAODoutput'
   [36] Calling method for module OniaPhotonConversionProducer/'oniaPhotonCandidates'
   [37] Rethrowing an exception that happened on a different thread.
   [38] Reading branch recoPFCandidates_particleFlow__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 15438, fKeylen = 137, fObjlen = 667416, noutot = 0, nout=0, nin=15301, nbuf=667416

----- End Fatal Exception -------------------------------------------------
kskovpen commented 3 years ago

Sorry, does anyone know if this issue should also be addressed by #33697 ?

makortel commented 3 years ago

Sorry, does anyone know if this issue should also be addressed by #33697 ?

No. The logs pointed to in the issue description mention 10_6_25, that includes #33697.

haozturk commented 2 years ago

Hi folks, do we have any update on this issue? I keep seeing production job failures occasionally. Not sure how to proceed w/ the corresponding workflows, e.g. https://its.cern.ch/jira/browse/CMSCOMPPR-21389

And a side question: I also see other types of "Fatal Root Error"s such as

Fatal Exception (Exit code: 8021)
An exception of category 'FileReadError' occurred while
[0] Constructing the EventProcessor
[1] Constructing input source of type PoolSource
[2] Reading branch EventAuxiliary
Additional Info:
[a] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffersSync
Storage::readv returned different size result=168532 expected=1096924

Do you think this is related to this issue or should we discuss this in a different thread?

davidlange6 commented 2 years ago

You could either exclude or reproduce the input file that creates this problem in the merge workflow.

On Nov 1, 2021, at 2:06 PM, Hasan Öztürk @.***> wrote:

Hi folks, do we have any update on this issue? I keep seeing production job failures occasionally. Not sure how to proceed w/ the corresponding workflows, e.g. https://its.cern.ch/jira/browse/CMSCOMPPR-21389

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

haozturk commented 2 years ago

For my education; does this error occur because there is something wrong w/ the input file? If so, what's exactly wrong?

pcanal commented 2 years ago

There is something 'odd' with the file. It is an older file that is not well clustered (pre-date the code that strengthen the entry clustering in TTree) and TTreeCache is failing to properly predict which basket to read.

haozturk commented 2 years ago

Okay, so we should either re-produce it w/ a new release or skip it as David suggested.

davidlange6 commented 2 years ago

Its a problem with an input file, yes. (Likely one of the n)

On Nov 1, 2021, at 2:23 PM, Hasan Öztürk @.***> wrote:

For my education; does this error occur because there is something wrong w/ the input file? If so, what's exactly wrong?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

haozturk commented 2 years ago

Hi folks, I spotted this issue in a few RAW data files in the production of AOD. What would you suggest in this case? For instance:

  1. /store/data/Run2018B/ParkingBPH3/RAW/v1/000/318/877/00000/46A79200-2E7B-E811-A6EC-02163E0176AF.root
    Fatal Exception (Exit code: 8021)
    An exception of category 'FileReadError' occurred while
    [0] Rethrowing an exception that happened on a different thread.
    [1] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
    Additional Info:
    [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
    fNbytes = 4965274, fKeylen = 115, fObjlen = 7466474, noutot = 0, nout=0, nin=4965159, nbuf=7466474
  2. /store/data/Run2018B/ParkingBPH2/RAW/v1/000/318/877/00000/46C862F9-207B-E811-901C-02163E0176AF.root
    Fatal Exception (Exit code: 8021)
    An exception of category 'FileReadError' occurred while
    [0] Rethrowing an exception that happened on a different thread.
    [1] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
    Additional Info:
    [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
    fNbytes = 4494229, fKeylen = 115, fObjlen = 6653028, noutot = 0, nout=0, nin=4494114, nbuf=6653028
makortel commented 2 years ago

@pcanal Is there an easy or straightforward way for me to check if these files suffer from the "older file that is not well clustered and TTreeCache is failing to properly predict which basket to read" issue too?

haozturk commented 2 years ago

Hi folks, we started to see this issue with more and more RAW files, I wanted to give a gentle bump to this issue to understand how to proceed with the processing of such files. A recent example: /store/data/Run2018C/ParkingBPH3/AOD/20Jun2021_UL2018-v1/260005/FA1D9039-7044-7443-887D-68FBDBB732D5.root

haozturk commented 2 years ago

My previous example was an AOD, sorry to misword it. Here is a RAW file which had a similar issue recently: /store/data/Run2018B/MuonEG/RAW/v1/000/318/874/00000/589245CE-197B-E811-BA56-FA163E34467F.root

makortel commented 2 years ago

I tested the /store/data/Run2018B/MuonEG/RAW/v1/000/318/874/00000/589245CE-197B-E811-BA56-FA163E34467F.root with cmsDriver.py test -s RAW2DIGI --conditions auto:run2_data --datatier RECO --eventcontent RECO --data --process reRECO --scenario pp --era Run2_2018 --no_exec (and adding the file there manually), that AFAICT should be sufficient to read FEDRawDataCollection_rawDataCollector__LHC entirely. I was able to process the full file in CMSSW_10_6_30_patch1 and in CMSSW_12_2_1.

@haozturk I presume you have already retried the processing of this file, does it fail repeatedly? If that is the case, could you give more details (like release, PSet, log file)?

haozturk commented 2 years ago

Hi @makortel yes it was retried multiple times. Please find the details below:

  1. The workflow
  2. Full log for a failed job
  3. PSet
  4. "CMSSWVersion": "CMSSW_10_6_30"

Please let me know if you need more info

makortel commented 2 years ago

Thanks @haozturk, I was in fact able to reproduce the issue with that file. I'll run some more tests.

makortel commented 2 years ago

I was able to reproduce the failure on 1 thread (and skipping events close to the "problematic one", 492405610:307:318874).

makortel commented 2 years ago

Actually the problematic file is /store/data/Run2018B/MuonEG/RAW/v1/000/318/874/00000/52727573-1A7B-E811-8786-02163E0176AF.root. The job processes data from two files, and the first file /store/data/Run2018B/MuonEG/RAW/v1/000/318/874/00000/589245CE-197B-E811-BA56-FA163E34467F.root gets processed fine.

makortel commented 2 years ago

With the right file already my simple RAW2DIGI test fails in the same way in 10_6_30, 10_6_30_patch1, 12_2_1, 12_4_0_pre2.

makortel commented 2 years ago

The file fails similarly in 10_1_7 that was used in data taking at the time (on HLT at least, not sure of Tier0 repacking, DAS reports 10_1_X as the release for the corresponding dataset).

@pcanal Does this mean the file has been corrupted or something?

I was able to process the file locally by dropping the LuminosityBlock 307.

haozturk commented 1 year ago

Hi all, this issue still persists in production. Can we make something about it?

makortel commented 1 year ago

Do the logs contain R__unzipLZMA: error 9 in lzma_code? As far as I can tell, the input file is corrupted in such cases.

Can you exclude the "offending" LuminosityBlock? (that helped earlier in a local test) The alternative would be to exclude full files. (assuming job resubmissions, or use of different copies of the input files, if available, have not helped)

Would a specific exit code for decompression failures help operations?

How often and with what kind of input files do these failures occur?

haozturk commented 1 year ago

Do the logs contain R__unzipLZMA: error 9 in lzma_code? As far as I can tell, the input file is corrupted in such cases.

No. I haven't checked all the affected workflows, but the ones I checked doesn't have it.

Can you exclude the "offending" LuminosityBlock? (that helped https://github.com/cms-sw/cmssw/issues/34393#issuecomment-1086438156) The alternative would be to exclude full files. (assuming job resubmissions, or use of different copies of the input files, if available, have not helped)

This sounds like something concerning central services. Perhaps @amaltaro @todor-ivanov can comment.

Would a specific exit code for decompression failures help operations?

Absolutely! We know that we can recover some 8021 FileRead errors, but not these.

Btw, this is not the only "FatalRootError" that we see under exit code 8021, see [1,2]. Should we include all FatalRootError's under a new exitCode or have a separate one for each one? From the operations perspective, all such fatal root errors aren't recoverable since something is 'wrong' w/ the files themselves.

[1] https://github.com/cms-sw/cmssw/issues/34686 [2] https://github.com/cms-sw/cmssw/issues/33361

How often and with what kind of input files do these failures occur?

It gets our attention when it's affecting ReReco workflows or causing significant stats loss in MC workflows. I see maybe a few workflows getting affected each week. Some weeks I don't see any issue. Since there is no separate exitCode for this issue, it's hard to give good stats.

I cannot see a pattern in the input file types. Here is a few:

/store/data/Run2022C/JetMET/MINIAOD/PromptReco-v1/000/357/271/00000/89f4be34-fd7e-4402-bfaf-b3963377239a.root
/store/mc/RunIISummer20UL18RECO/NMSSM_XToYHTo2Z2BTo2L2J2B_MX-1500_MY-800_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/60000/341E1E4B-FC35-3741-8379-796DFE61E248.root
/store/unmerged/RunIISummer20UL18MiniAODv2/NMSSM_XToYHTo2Z2BTo2L2J2B_MX-4000_MY-70_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/60000/EB68BF54-694F-D849-A7E6-E05CA0DA8AE4.root
makortel commented 1 year ago

Thinking a bit more I think a new exit code probably would not help much (what I had in mind was something along "decompression failure", but in a sense that would only be a subcategory of generic read error). When reading data, we don't know if the data stream becomes corrupt because of the file itself is corrupt, or the corruption is caused by any transport layer in between.

Only if the same read error pattern repeats, the likelihood of the cause being in the file itself being corrupt begins to increase. In general confirming that a file is corrupt requires reading it.

I tried out the files

/store/data/Run2022C/JetMET/MINIAOD/PromptReco-v1/000/357/271/00000/89f4be34-fd7e-4402-bfaf-b3963377239a.root
/store/mc/RunIISummer20UL18RECO/NMSSM_XToYHTo2Z2BTo2L2J2B_MX-1500_MY-800_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/60000/341E1E4B-FC35-3741-8379-796DFE61E248.root
/store/unmerged/RunIISummer20UL18MiniAODv2/NMSSM_XToYHTo2Z2BTo2L2J2B_MX-4000_MY-70_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/60000/EB68BF54-694F-D849-A7E6-E05CA0DA8AE4.root

by first copying them to local disk with xrdcp (via the global redirector), and then running a simple job in 12_4_10 (just the PoolSource with process.source.delayReadingEventProducts = cms.untracked.bool(False) to force the job to read everything; at least the job was able to fail with a truly corrupted file). In all three cases the job succeeded. In the first two cases the adler32 checksum is the same as in DAS (third file is not in DAS so I can't check).

So in these three cases the file itself seems to be ok (at least on the site the global redirector pointed to), and the read errors like likely caused by something going wrong during transport (or some other site-specific error), or bugs in (mostly earlier) software.

Is there any systematic pattern within sites?

makortel commented 1 year ago

By the way, there are several reports of (possibly) corrupted MiniAOD files in the Computing Tools CMS Talk

Maybe there would be some synergies in figuring out whether a given file on a given storage element is corrupt?

todor-ivanov commented 1 year ago

Hi @haozturk @makortel

Sorry, for the late reply here, I just noticed you have tagged us few days ago.

Can you exclude the "offending" LuminosityBlock? ...

This sounds like something concerning central services. Perhaps @amaltaro @todor-ivanov can comment.

One can create a LumiList at request creation time. But this needs to be a full list using one of the formats mentioned here. I am not aware of a way to exclude a single luminosity block. So the full set of lumis described in the LumiList should be split in two pieces - before and after the offending one.

todor-ivanov commented 1 year ago

And just to continue on that chain of thoughts.

There is no way for us to skip a file from a data block. We can create lists of Allowed/ NotAllowed data blocks or runs during submission and assignment time, but not lists of files. So if a file is proven bad and we are sure this is not an infrastructure issue (as Matti correctly points out that this could easily turn out to be the case), then the files need to be invalidated for good, we have no mechanism to skip anything on a file basis. @amaltaro please correct me here if I am wrong.