cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Move away from variable-length array #44937

Open srimanob opened 5 months ago

srimanob commented 5 months ago

This is a follow up of https://github.com/cms-sw/cmssw/issues/44306 where we see the crash which cen be solved by moving away from variable-length array.

          > I've only checked the L1Trigger/TrackFinding* packages by recompiling them with the `-Werror=vla` flag, but there seem to be no more instances of this particular problem there.

Just for fun, here is a table of all variable-length arrays in L1Trigger in CMSSW_14_1_0_pre3. I leave it to the experts of other subpackages to fix them, but hopefully this is a useful starting point.

File name Line number Name of offending array
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 1004 useFit
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 122 useFitSL1
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 125 useFitSL3
L1Trigger/L1TCaloLayer1/src/UCTRegion.cc 132 activeTower
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 264 idxMu
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 265 muPtSorted
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 333 idxEg
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 334 egPtSorted
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 364 idxTau
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 365 tauPtSorted
L1Trigger/L1TGlobal/src/CorrCondition.cc 369 InvDeltaRSqLUT
L1Trigger/L1TGlobal/src/CorrCondition.cc 370 temp_InvDeltaRSq
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 253 isSeed
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 372 toRemove
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 44 isSeed
L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc 150 epbins_default
L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc 196 epbins
L1Trigger/L1TTrackMatch/plugins/L1TrackJetProducer.cc 140 epbins_default
L1Trigger/L1TTrackMatch/plugins/L1TrackJetProducer.cc 179 epbins
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 290 work
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 304 halfsorted
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 304 work
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 333 tomerge
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 113 OutTmp
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 128 outTmp2
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 67 out2
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 70 out3
L1Trigger/Phase2L1ParticleFlow/src/L1TCorrelatorLayer1PatternFileWriter.cc 325 ret
L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc 194 dupMap
L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc 202 noMerge

Originally posted by @aehart in https://github.com/cms-sw/cmssw/issues/44306#issuecomment-2101264168

cmsbuild commented 5 months ago

cms-bot internal usage

cmsbuild commented 5 months ago

A new Issue was created by @srimanob.

@sextonkennedy, @rappoccio, @makortel, @antoniovilela, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 5 months ago

I'd suspect the reason for the failure reported in https://github.com/cms-sw/cmssw/issues/44306 was the array in stack became large, rather than the array being variable-length itself.

As mentioned in https://github.com/cms-sw/cmssw/issues/44306#issuecomment-2100666666, VLA is a non-standard extension. From the past I recall e.g. tracking code uses https://github.com/cms-sw/cmssw/blob/2169b5684eb29db8aa2e60eded3eedfc3479df80/CommonTools/Utils/interface/DynArray.h#L4-L5 for performance reasons (but I can't tell on the top of my head how big the impact of moving to e.g. std::vector would be there).

Is the scope of this issue to remove VLA in L1T code, or everywhere in CMSSW?

srimanob commented 5 months ago

Hi @makortel Thanks for the comment. My initial thought is how risk we are on this VLA, as this is just a pop-up issues in L1T module. It seems we will never know if we are at boundary of failure until it fails. Do we have cons for migration?

VinInn commented 5 months ago

Do we have cons for migration? YES. performance.

Just avoid those very large arrays in first place.

To be safe one can use the DynArray and then at run time initialize either with a pointer to a VLA or with one to the heap depending to to size of the allocation required...

makortel commented 5 months ago

It seems we will never know if we are at boundary of failure until it fails.

This is correct for stack overflows in general. While the problem of large arrays in stack is strictly speaking orthogonal to the question of whether we should allow VLA to be used or not, the dynamic nature makes potentially large VLAs somewhat more difficult to catch in code review, PR tests, and IB tests compared to compile-time-defined arrays.

dan131riley commented 5 months ago

There's an additional issue that large VLAs that cause problems with multi-threaded processes may not show up in single-threaded PR tests. I haven't checked if we impose the same stack size limit in both cases.

makortel commented 5 months ago

I haven't checked if we impose the same stack size limit in both cases.

All TBB threads should use the same stack size https://github.com/cms-sw/cmssw/blob/22f751f0592542ae94a797e3fc22294ddc4626cf/FWCore/Framework/bin/cmsRun.cpp#L237-L241 https://github.com/cms-sw/cmssw/blob/22f751f0592542ae94a797e3fc22294ddc4626cf/FWCore/Concurrency/src/setNThreads.cc#L20 https://github.com/cms-sw/cmssw/blob/22f751f0592542ae94a797e3fc22294ddc4626cf/FWCore/Concurrency/src/ThreadsController.cc#L20