crogan / KUEWKinoAnalysis

KU EWKino Analysis
1 stars 11 forks source link

Errors when running Build Fit Input #8

Closed caleb-james-smith closed 1 year ago

caleb-james-smith commented 2 years ago

Last week, when running Build Fit Input on condor at CMS LPC, I had a reoccurring issue that affected many condor jobs in multiple submissions. It related to a server error when loading large signal ROOT files (e.g. over 30 GB in size). These files are loaded in all jobs; a large fraction of jobs gave the same server error, which leads to a seg fault.

Running the following on CMS LPC to submit condor jobs:

./BuildFitInputCondor.x -maxN 1 ++bkg +proc T2bW ++cat -year 2017 -lumi 137 --connect -path root://xrootd.unl.edu//store/user/zflowers/crogan/ -o test_BuildFitInput/

Example error message:

1 tar: write error
2 WARNING: In non-interactive mode release checks e.g. deprecated releases, production architectures are disabled.
3 Error in <TNetXNGFile::Open>: [ERROR] Server responded with an error: [3011] No servers are available to read the file.
4
5
6  *** Break *** segmentation violation
7
8
9
10 ===========================================================
11 There was a crash.
12 This is the entire stack trace of all threads:
13 ===========================================================

Example prints from log to check for open file:

is root://xrootd.unl.edu//store/user/zflowers/crogan/Summer16_102X_SMS/SMS-TSlepSlep_TuneCUETP8M1_13TeV-madgraphMLM-pythia8_Summer16_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Summer16_102X_SMS/SMS-TSlepSlep_TuneCP2_13TeV-madgraphMLM-pythia8_ext_Summer16_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Summer16_102X_SMS/SMS-TSlepSlep_mSlep-500To1300_TuneCUETP8M1_13TeV-madgraphMLM-pythia8_Summer16_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-T2bW_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-T2bW_X05_dM-10to80_genHT-160_genMET-80_mWMin-0p1_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-T2bW_X05_dM-10to80_2Lfilter_mWMin-0p1_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-T2tt_dM-10to80_genHT-160_genMET-80_mWMin-0p1_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-T2tt_dM-6to8_genHT-160_genMET-80_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-T2tt_mStop-400to1200_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 1
is root://xrootd.unl.edu//store/user/zflowers/crogan/Fall17_102X_SMS/SMS-TChiWZ_ZToLL_TuneCP2_13TeV-madgraphMLM-pythia8_Fall17_102X.root open? 

In this case, the seg fault was occurring for the last file in the list.

I avoided this error by adding a "SKIP_SIGNAL" flag to skip the InitSMS() statements for all signals, and processing only background. This worked for running background only. However, a different fix is needed to run signal.

https://github.com/crogan/KUEWKinoAnalysis/blob/friday/src/SampleTool.cc#L237

https://github.com/crogan/KUEWKinoAnalysis/blob/friday/src/SampleTool.cc#L353

caleb-james-smith commented 1 year ago

This is a known feature... by default, InitSMS() is run for all signals, and the respective signal sample files need to exist. So to skip them, SKIP_SIGNAL workarounds are fine for now...