CoffeaTeam / coffea-casa

Repository with configuration setup of a prototype of analysis facility - "coffea-casa"
BSD 3-Clause "New" or "Revised" License
17 stars 19 forks source link

Debugging KilledWorker exceptions appearing at scale #310

Open alexander-held opened 2 years ago

alexander-held commented 2 years ago

When running this CMS Open Data ttbar analysis at the UChicago coffea-casa instance over the full number of input files with a pure coffea setup, RuntimeError exceptions start appearing typically somewhere around halfway at the pre-processing stage:

KilledWorker: ('automatic_retries-5a5ee0b8-ee99-4ff5-8534-7e2373a71078-19743', <WorkerState 'tls://c006.af.uchicago.edu:36499', name: htcondor--194595.0--, status: closed, memory: 0, processing: 74>)

RuntimeError: Work item FileMeta(https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/WJetsToLNu_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext2-v1/60002/1C9F04FB-72DB-E511-AFF7-0CC47A4D9A70.root:events) caused a KilledWorker exception (likely a segfault or out-of-memory issue)

The filename changes between repeated runs, so it does not seem to be related to a specific problematic input. Here is another example:

RuntimeError: Work item FileMeta(https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/ST_tW_top_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/70000/8E03C8E8-C7B8-E511-8A04-00259029E84C.root:events) caused a KilledWorker exception (likely a segfault or out-of-memory issue)

I am not sure how to best debug this further and would be happy to try out some suggestions.

alexander-held commented 2 years ago

Slightly different error with N_FILES_MAX_PER_SAMPLE = 100:

KilledWorker: ('TtbarAnalysis-522cecba9f095e893a09583293a1b218', <WorkerState 'tls://c029.af.uchicago.edu:35509', name: htcondor--194839.0--, status: closed, memory: 0, processing: 5>)

RuntimeError: Work item WorkItem(dataset='ttbar__nominal', filename='https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/TT_TuneCUETP8M1_13TeV-powheg-pythia8/MINIAODSIM//PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1/00000/9C747AED-4BC2-E511-BF56-AC853D9DACD3.root', treename='events', entrystart=0, entrystop=49051, fileuuid=b' \xdd\xfeB\xb8\xec\x11\xec\x98#\x02B\xac\x13\x00\x0e', usermeta={'xsec': 729.84, 'process': 'ttbar', 'nevts': 4370893, 'variation': 'nominal'}) caused a KilledWorker exception (likely a segfault or out-of-memory issue)