cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

[ARM] Assertion failure in gpuVertexFinder in 11634.24 #37820

Open makortel opened 2 years ago

makortel commented 2 years ago

Workflow 11634.24 step 2 has been failing on el8_aarch64_gcc10 at least since CMSSW_12_4_X_2022-04-28-2300 with

cmsRun: /data/cmsbuild/jenkins_a/workspace/build-any-ib/w/tmp/BUILDROOT/28e97d506f1bae1e45437cea84c399e8/opt/cmssw/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:73: void gpuVertexFinder::fitVertices(gpuVertexFinder::ZVertices*, gpuVertexFinder::WorkSpace*, float): Assertion `wv[i] > 0.f' failed.

Thread 1 (Thread 0x400040a8b730 (LWP 2277007) "cmsRun"):
#3  0x00004000430d8600 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000400040e52c74 in raise () from /lib64/libc.so.6
#6  0x0000400040e4096c in abort () from /lib64/libc.so.6
#7  0x0000400040e4c4c4 in __assert_fail_base () from /lib64/libc.so.6
#8  0x0000400040e4c530 in __assert_fail () from /lib64/libc.so.6
#9  0x00004000f25e0edc in gpuVertexFinder::Producer::make(TrackSoAHeterogeneousT<32768> const*, float, float) const () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#10 0x00004000f25d6100 in PixelVertexProducerCUDA::produceOnCPU(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#11 0x000040003ef70c50 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#12 0x000040003ef6798c in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#13 0x000040003eec3a28 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#14 0x000040003eec3d74 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#15 0x000040003eec653c in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#16 0x000040003f456ed8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/nweek-02731/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-04-2300/lib/el8_aarch64_gcc10/libFWCoreConcurrency.so

Current Modules:
Module: PixelVertexProducerCUDA:hltPixelVerticesSoA@cpu (crashed)
Module: L1TGlobalProducer:hltGtStage2ObjectMap
Module: EcalRawToDigi:hltEcalDigisLegacy
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc10/CMSSW_12_4_X_2022-05-04-2300/pyRelValMatrixLogs/run/11634.24_TTbar_14TeV+2021_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+RecoNano+HARVESTNano+ALCA/step2_TTbar_14TeV+2021_0T+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+RecoNano+HARVESTNano+ALCA.log

makortel commented 2 years ago

assign reconstruction, heterogeneous

cmsbuild commented 2 years ago

New categories assigned: heterogeneous,reconstruction

@jpata,@slava77,@fwyzard,@clacaputo,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 years ago

FYI @VinInn @AdrianoDee

fwyzard commented 2 years ago

This is a CPU-only workflow, right ?

makortel commented 2 years ago

This is a CPU-only workflow, right ?

I think so. The least the SwitchProducer is using @cpu case.

jpata commented 2 years ago

type tracking

mmusich commented 2 years ago

I think the type here should be tracking and not trk (vertexing is under tracking)

aandvalenzuela commented 1 year ago

Hello, Just to keep track of this issue :) This assertion failure is still present in the current release cycle:

cmsRun: /data/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/f4101ca38f0ff520e5922918c7986929/opt/cmssw/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-02-19-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:70: void gpuVertexFinder::fitVertices(gpuVertexFinder::VtxSoAView&, gpuVertexFinder::WsSoAView&, float): Assertion `wv[i] > 0.f' failed.

See most recent stacktrace. And it is also present in LTO IBs since we build for ARM now.

fwyzard commented 3 months ago

I assumed this stopped failing once the HLT menu for this workflows was moved to the "fake" menu ?

cmsbuild commented 3 months ago

cms-bot internal usage

makortel commented 3 months ago

I assumed this stopped failing once the HLT menu for this workflows was moved to the "fake" menu ?

Quite possible. On a quick look I didn't see this particular error in the IBs of past two weeks, but I also don't recall how frequent the failure was.

mmusich commented 3 months ago

I assumed this stopped failing once the HLT menu for this workflows was moved to the "fake" menu ?

I guess we can make it reappear real quick by allowing 2024 here:

https://github.com/cms-sw/cmssw/blob/643935aa315faaa679fb06b97a8cf25f3713ef1d/Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py#L2188

fwyzard commented 3 months ago

Do you think 12834.402 should also trigger the issue ? I can try running that by hand on lxplus-arm (ARM Neoverse-N1) to check.

fwyzard commented 3 months ago

12834.402 dos not seem to reproduce the issue, or at least not easily: I've run its step2 over 20 times on 100 events without problems on lxplus-arm.

mmusich commented 3 months ago

Do you think 12834.402 should also trigger the issue ?

12834.402 does not seem to reproduce the issue,

I don't know if it is relevant but the original workflow 11634.24 forces the magnetic field to be 0T.

makortel commented 2 months ago

Given all the changes (CUDA-to-Alpaka, related fixes in the Alpaka code, HLT menu updates) maybe we have reached the time to close this issue?