cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

HLT Farm crashes in run 378366~378369 #44541

Open wonpoint4 opened 6 months ago

wonpoint4 commented 6 months ago

Report the large numbers of GPU-related HLT crashes yesterday (elog)

Here's the recipe how to reproduce the crashes. (tested with CMSSW_14_0_3 on lxplus8-gpu)

#!/bin/bash -ex

hltGetConfiguration adg:/cdaq/cosmic/commissioning2024/v1.1.0/HLT/V2 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240325_run378367/files/run378367_ls0016_index000315_fu-c2b05-11-01_pid2219084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Here's the other way to reproduce the crashes.

# log in to an online GPU development machine (or lxplus8-gpu) and create a CMSSW area for 14.0.2
cmsrel CMSSW_14_0_2
cd CMSSW_14_0_2/src
cmsenv
# copy the HLT configuration that reproduces the crash and run it
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378366 > hlt_run378366.py
cat after_menu.py >> hlt_run378366.py ### See after_menu.py below
mkdir run378366
cmsRun hlt_run378366.py &> run378366.log

vi after_menu.py

from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream',
    runNumber = 378366
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run378366/run378366_ls0001_index000000_fu-c2b03-05-01_pid1739399.raw',
    )
)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @wonpoint4.

@antoniovilela, @smuzaffar, @rappoccio, @Dr15Jones, @sextonkennedy, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 6 months ago

assign hlt, heterogeneous

cmsbuild commented 6 months ago

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 6 months ago

Running the reproducer with CUDA_LAUNCH_BLOCKING=1 shows

terminate called after throwing an instance of 'std::runtime_error'
  what():
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

#3  0x00007f2d11fbf720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f2d18272acf in raise () from /lib64/libc.so.6
#6  0x00007f2d18245ea5 in abort () from /lib64/libc.so.6
#7  0x00007f2d18c4ea49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00007f2d18c5a06a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007f2d18c590d9 in __cxa_call_terminate (ue_header=0x7f2c68e82820) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00007f2d18c597f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f2c69ff8380) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:688
#11 0x00007f2d1881f864 in _Unwind_RaiseException_Phase2 (exc=0x7f2c68e82820, context=0x7f2c69ff8380, frames_p=0x7f2c69ff8288) at ../../../libgcc/unwind.inc:64
#12 0x00007f2d188202bd in _Unwind_Resume (exc=0x7f2c68e82820) at ../../../libgcc/unwind.inc:242
#13 0x00007f2d0e2c2f5c in cms::cuda::free_device(int, void*) [clone .cold] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libHeterogeneousCoreCUDAUtilities.so
#14 0x00007f2ca620e028 in HBHERecHitProducerGPU::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) [clone .cold] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/pluginRecoLocalCaloHcalRecProducers.so
#15 0x00007f2d1ada1959 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x00007f2d1ada8099 in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#17 0x00007f2d1ad7b412 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#18 0x00007f2d1ad7b596 in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f2d1ad18b0f in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_2/lib/el8_amd64_gcc12/libFWCoreFramework.so

FYI @cms-sw/hcal-dpg-l2

abdoulline commented 6 months ago

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

mmusich commented 6 months ago

So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US

For the record in neighboring runs there have been also crashes in the online DQM, see e.g.: 378366. It would be interesting to know if that's due to the same kind of change (in that case a protection in the CPU code might be needed as well).

abdoulline commented 6 months ago

@mmusich yes, the origin of DQM crashes is the same. It (SOI move) revealed a lack of protection in one of the HCAL reco components (signal time fit in MAHI) added at the end of 2022. Tracked down to a couple of "suboptimal" lines. Protection/workaround is being discussed.

fwyzard commented 6 months ago

... if and when we will have a full Alpaka implementation of the HCAL reconstruction, we will have a single code base to maintain :)

kakwok commented 6 months ago

I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations

syuvivida commented 6 months ago

Hi @abdoulline @lwang046 Is there an estimate when hcalreco DQM client (and maybe the other hcal client as well?) will be updated? Thanks!!

Eiko for DQM-DC

abdoulline commented 6 months ago

Hi @syuvivida I suppose it shouldn't be a major issue=showstopper (as it wasn't in 2023), now that HCAL Digi format is back to the regular one after the aforementioned accident. It's rather a question of implementing additional protection, right?
The HCAL Reconstruction convener ( "hcalreco" in question is used everywhere, not only in DQM), @igv4321 has been contacted.

syuvivida commented 6 months ago

Hi @abdoulline indeed I was referring to adding the protection in the client hcalreco, sorry for not being explicit earlier. It is not a major issue now but it would be nice to have the code in place before things are forgotten (as many new things may appear when 13.6 TeV collisions arrive). Thanks!!

Eiko

abdoulline commented 6 months ago

@syuvivida
sure, we'll report to this open issue (to eventually ask for its closure).

saumyaphor4252 commented 6 months ago

@abdoulline We are now also seeing some failures in T0 Prompt processing jobs with similar symptoms. See

abdoulline commented 6 months ago

@saumyaphor4252 yes, it was kind of predictable, unfortunately... I'm afraid all the runs in the range 378361--378467 (the first "regular" HCAL setting were back in 378468) are affected. If we exclude the runs that don't have HCAL in global, it's 378361-378432. Can those be excluded/invalidated, as HCAL Digi settings/configuration were "non-standard" anyway ?

@igv4321 FYI

abdoulline commented 5 months ago

Just to add explicitly @mariadalfonso

missirol commented 4 months ago

@cms-sw/hcal-dpg-l2

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

Will this be done for the CUDA implementation ?

missirol commented 4 months ago

@kakwok

I'll make sure the Alpaka implementation has some protection against different SOI/TS configurations

Is this included in https://github.com/cms-sw/cmssw/pull/44910 ? (if so, where ? just out of curiosity)

kakwok commented 4 months ago

The problem has been caused by a change in the HCAL HB/HE raw data, i.e. a position of the trigger TS (=SOI for "Sample Of Interest") in 8-TS Digi array, which was done on Sunday night (March 24) and was originally planned only for local LED runs, but (unwantedly) stayed in subsequent GRs... Now it's reverted back to the nominal configuration.

Thanks to the clarification of @mariadalfonso (who's in US for a Workshop) HCAL@GPU does imply both fixed number of TS (8) and SOI (4th TS). So, an additional protection/warning will be added to HCAL@GPU upon Maria's return back from US.

Hi @missirol , thanks for bringing this up, it's not included in #44910 yet. The issue seems to be a mis-configuration that MAHI does not currently support. I need more information about

Maybe @abdoulline or @mariadalfonso will have some idea about these questions? Then we can discuss weather to include these changes in #44910

abdoulline commented 4 months ago

@kakwok @mariadalfonso To my knowledge, MAHI (as is) can not cope with moving/changed (other than ==3) SOI position. So, we're talking (just) about letting MAHI to die gracefully (instead of provoking segfault) with an appropriate logError. MAHI input is QIE11DigiCollection with QIE11DataFrame constituents, having:

bool .soi()
https://cmssdt.cern.ch/lxr/source/DataFormats/HcalDigi/interface/QIE11DataFrame.h#0044 which is used for calculating

int presamples() https://cmssdt.cern.ch/lxr/source/DataFormats/HcalDigi/interface/QIE11DataFrame.h#0079

Normally presamples == 3. Otherwise this is bad data originating from misconfigured HCAL (as it was back on March 24-25) which shouldn't happen.

fwyzard commented 4 months ago

@abdoulline thanks for the comments and suggestions.

IMHO there are various options that would work better than the current failure mode:

The LogError is fine - even though nobody will likely see it.

abdoulline commented 4 months ago

@fwyzard I agree - it's fair enough to detect unexpected SOI shift in the unpacking step before reconstruction. Would need a configurable of the expected SOI (to compare with). The same holds for the "entrance" of the reconstruction.

But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly?

Empty collection of RecHits would mean a large part of HCAL is out. It'd alter severely most of the triggers, I suppose.

M0 is a very poor replacement of MAHI in HE (not only absence of PU mitigation, but also could induce an energy scale difference) and it uses TS window limits from DB (so they need to be re-adjusted on the fly...).

Now (if it's not just about stopping the jobs) this issue may need to be discussed in HCAL DPG. 🤔

fwyzard commented 4 months ago

But the source of the problem was a general HCAL misconfiguration and (I'd think) it better be spotted and fixed asap rather than to be mitigated somehow on the fly?

I agree, but crashing the whole HLT farm is not the right way to detect the problem.

I'm happy with any solution that makes it clear the data is bad, but does not require cleaning up about 200 HLT nodes.

mariadalfonso commented 4 months ago

Was this again a Phase Scan ? For these technical runs we should have another sequence for this i.e. the CPU version.

Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen. Since is being rewritten we should solve this directly there.

abdoulline commented 4 months ago

Hi Maria
@mariadalfonso

no there were no new instances of the issue since March 24-25 (HCAL misconfig). It was just a return back to the pending subject...

So, the goal is (1) not stop HLT farm (2) detect (make the problem to be known) asap if it happens, to reconfigure HCAL asap.

abdoulline commented 4 months ago

@kakwok

just would like to draw your attention to Maria's suggestion:

Mahi on CPU can cope with shift and also extended number of timeslices i.e. from 8 to 10, but for the GPU-CUDA implementation is all kind of frozen. Since is being rewritten we should solve this directly there.

kakwok commented 4 months ago

@abdoulline The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

mmusich commented 2 months ago

The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

just for the record, mahi @ alpaka still crashes:

#!/bin/bash -ex

# List of run numbers
runs=(378366 378369)

# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  # export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \
    --globaltag ${global_tag} \
    --data \
    --eras Run3 \
    --l1-emulator uGT \
    --l1 L1Menu_Collisions2024_v1_3_0_xml \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

results in:

Thread 1 (Thread 0x7fe5a92e5640 (LWP 1447698) "cmsRun"):
#0  0x00007fe5a9ec0ac1 in poll () from /lib64/libc.so.6
#1  0x00007fe5a20660cf in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe5a201a1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFW
CoreServicesPlugins.so
#3  0x00007fe5a201a370 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe4f1e0072e in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 3ul>, unsigned int, alpaka_serial_sync::hcal::reconstruction::mahi::Kernel_prep_pulseMatrices_sameNumberOfSampl
es, float*, float*, float*, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, float*, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstView
TemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>
::ConstViewTemplateFreeParams<128ul, false, true, true> const&, signed char*, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalReco
ParamWithPulseShapeT<alpaka::DevCpu>::ConstView const&, float const&, float const&, float const&, bool const&, float const&, float const&, float const&>::operator()() const () from /cvmfs/cms.cern.ch/el8
_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#6  0x00007fe4f1e09388 in alpaka_serial_sync::hcal::reconstruction::runMahiAsync(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTemplateFreePa
rams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTem
plateFreeParams<128ul, false, true, true> const&, hcal::HcalRecHitSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemp
lateFreeParams<128ul, false, true, true> const&, hcal::HcalSiPMCharacteristicsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalRecoParamWithPulseShapeT<alp
aka::DevCpu>::ConstView const&, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, alpaka_serial_sync::hcal::reconstruction::ConfigParameters
 const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#7  0x00007fe4f1ddde29 in alpaka_serial_sync::HBHERecHitProducerPortable::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_g
cc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#8  0x00007fe4f1de03d3 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MUL
TIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so
#9  0x00007fe5ac93b4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12
/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 0x00007fe5ac91fc6c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/
CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007fe5ac8a7f69 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti
on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA
ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007fe5ac8a84d5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_M
ULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007fe5aca5a1d8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CM
SSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so
#14 0x00007fe5ab051281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/worksp
ace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_
pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1
-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fe5ac82942b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fe5ac83325d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fe5ac8337c1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fe5ab03d9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8
_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#22 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#23 0x000000000040517c in main ()

Current Modules:

Module: alpaka_serial_sync::HBHERecHitProducerPortable:hltHbheRecoSoASerialSync (crashed)

A fatal system signal has occurred: segmentation violation

@kakwok any plans about this?

kakwok commented 2 months ago

Has there been any change of Hcal configuration for number of TS in the digi recently?

On Wed, Jul 24, 2024, 20:15 Marco Musich @.***> wrote:

The current PR is already very big. I would prefer to implement functional changes after integrating the current PR. This will make the validation and integration much easier. But let's keep this improvement in mind for the (near) future.

just for the record, mahi @ alpaka still crashes:

!/bin/bash -ex

List of run numbers

runs=(378366 378369)

Base directory for input files on EOS

base_dir="/store/group/tsg/FOG/error_stream_root/run"

Global tag for the HLT configuration

global_tag="140X_dataRun3_HLT_v3"

EOS command (adjust this if necessary for your environment)

eos_cmd="eos"

Loop over each run numberfor run in "${runs[@]}"; do

Set the MALLOC_CONF environment variable

export MALLOC_CONF=junk:true

Construct the input directory path

input_dir="${base_dir}${run}"

Find all root files in the input directory on EOS

root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

Check if there are any root files found

if [ -z "${root_files}" ]; then echo "No root files found for run ${run} in directory ${input_dir}." continue fi

Create filenames for the HLT configuration and log file

hlt_config_file="hlt_run${run}.py" hlt_log_file="hlt_run${run}.log"

Generate the HLT configuration file

hltGetConfiguration /online/collisions/2024/2e34/v1.4/HLT/V2 \ --globaltag ${global_tag} \ --data \ --eras Run3 \ --l1-emulator uGT \ --l1 L1Menu_Collisions2024_v1_3_0_xml \ --no-prescale \ --no-output \ --max-events -1 \ --input ${root_files} > ${hlt_config_file}

Append additional options to the configuration file

cat @. >> ${hlt_config_file}del process.MessageLoggerprocess.load('FWCore.MessageService.MessageLogger_cfi') process.options.wantSummary = Trueprocess.options.numberOfThreads = 1process.options.numberOfStreams = @.

Run the HLT configuration with cmsRun and redirect output to log file

cmsRun ${hlt_config_file} &> ${hlt_log_file} done

results in:

Thread 1 (Thread 0x7fe5a92e5640 (LWP 1447698) "cmsRun"):

0 0x00007fe5a9ec0ac1 in poll () from /lib64/libc.so.6

1 0x00007fe5a20660cf in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so

2 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so#2 0x00007fe5a201a1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFW

CoreServicesPlugins.so

3 0x00007fe5a201a370 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so

4 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so#4

5 0x00007fe4f1e0072e in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 3ul>, unsigned int, alpaka_serial_sync::hcal::reconstruction::mahi::Kernel_prep_pulseMatrices_sameNumberOfSampl

es, float, float, float, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, float, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstView TemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false> ::ConstViewTemplateFreeParams<128ul, false, true, true> const&, signed char*, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalReco ParamWithPulseShapeT::ConstView const&, float const&, float const&, float const&, bool const&, float const&, float const&, float const&>::operator()() const () from /cvmfs/cms.cern.ch/el8 _amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so

6 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#6 0x00007fe4f1e09388 in alpaka_serial_sync::hcal::reconstruction::runMahiAsync(alpaka::QueueGenericThreadsBlocking&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTemplateFreePa

rams<128ul, false, true, true> const&, hcal::HcalPhase0DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalPhase1DigiSoALayout<128ul, false>::ConstViewTem plateFreeParams<128ul, false, true, true> const&, hcal::HcalRecHitSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>, hcal::HcalMahiConditionsSoALayout<128ul, false>::ConstViewTemp lateFreeParams<128ul, false, true, true> const&, hcal::HcalSiPMCharacteristicsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, hcal::HcalRecoParamWithPulseShapeT<alp aka::DevCpu>::ConstView const&, hcal::HcalMahiPulseOffsetsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, alpaka_serial_sync::hcal::reconstruction::ConfigParameters const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so

7 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#7 0x00007fe4f1ddde29 in alpaka_serial_sync::HBHERecHitProducerPortable::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_g

cc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so

8 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#8 0x00007fe4f1de03d3 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MUL

TIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so

9 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducersPluginsPortableSerialSync.so#9 0x00007fe5ac93b4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry, edm::ModuleCallingContext const) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12

/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

10 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#10 0x00007fe5ac91fc6c in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/

CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

11 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#11 0x00007fe5ac8a7f69 in std::exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::exception_ptr::excepti

on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

12 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#12 0x00007fe5ac8a84d5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_M

ULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

13 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#13 0x00007fe5aca5a1d8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CM

SSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so

14 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so#14 0x00007fe5ab051281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/worksp

ace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322

15 tbb::detail::r1::task_dispatcher::local_wait_for_all (waiter=..., t=, this=0x7fe5a7cdbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_10

pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458

16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1

-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168

17 0x00007fe5ac82942b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

18 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#18 0x00007fe5ac83325d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

19 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#19 0x00007fe5ac8337c1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so

20 http://cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_12_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so#20 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()

21 0x00007fe5ab03d9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8

_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688

22 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()

23 0x000000000040517c in main ()

Current Modules:

Module: alpaka_serial_sync::HBHERecHitProducerPortable:hltHbheRecoSoASerialSync (crashed)

A fatal system signal has occurred: segmentation violation

@kakwok https://github.com/kakwok any plans about this?

— Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/44541#issuecomment-2248634585, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBUQPJPOZVAR52TDCY7OADZN7VMFAVCNFSM6AAAAABFHTTHFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGYZTINJYGU . You are receiving this because you were mentioned.Message ID: @.***>

mmusich commented 2 months ago

Has there been any change of Hcal configuration for number of TS in the digi recently?

I don't know, but just to be clear this is using old data (from run 378366~378369) back in March. I think the agreement was to try to protect it once we have mahi @ alpaka in release.

kakwok commented 2 months ago

Ah ok, then it's expected. We concluded that was a configuration error, and agreed that protection will be added in the next iteration.

On Wed, Jul 24, 2024, 20:20 Marco Musich @.***> wrote:

Has there been any change of Hcal configuration for number of TS in the digi recently?

I don't know, but just to be clear this is using old data (from run 378366~378369) back in March. I think the agreement was to try to protect it once we have mahi @ alpaka in release.

— Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/44541#issuecomment-2248644358, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBUQPLLHTUQMZ7JBGMMAVTZN7WAFAVCNFSM6AAAAABFHTTHFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY2DIMZVHA . You are receiving this because you were mentioned.Message ID: @.***>

mmusich commented 2 months ago

will be added in the next iteration.

the question is about the plan (timeline) for the next iteration.