cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.06k stars 4.25k forks source link

Failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 #44306

Open srimanob opened 4 months ago

srimanob commented 4 months ago

I observe a failure in L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation' in 14_0_0 in recent NoPU relvals, https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_FlatPt_15_3000HS_14__STD_2026D98_noPU_240226_205544_77

Fatal Exception (Exit code: 8022)
An exception of category 'FatalRootError' occurred while
[0] Processing Event run: 1 lumi: 432 event: 43164 stream: 2
[1] Running path 'L1TrackTrigger_step'
[2] Calling method for module L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation'
Additional Info:
[a] Fatal Root Error: @SUB=operator=(const TMatrixT &)
matrices not compatible

To reproduce the issue with CMSSW_14_0_0: Input file = /eos/cms/store/relval/CMSSW_14_0_0/RelValQCD_FlatPt_15_3000HS_14/GEN-SIM/140X_mcRun4_realistic_v1_STD_2026D98_noPU-v1/2580000/1af0b992-5804-40e9-911f-933e5c413f97.root

and cmsDriver on step2: cmsDriver.py step2 -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9 --python DigiTrigger_2026D98.py --no_exec --filein file:step1.root --fileout file:step2.root --nThreads 8 --nStreams 1 --customise_commands "process.source.lumisToProcess = cms.untracked.VLuminosityBlockRange('1:432-1:432') \n process.source.eventsToProcess = cms.untracked.VEventRange('1:43160-1:43169')"

CMSSW_14_0_0 already include a (temporary) fix on track jet eta, https://github.com/cms-sw/cmssw/pull/43922, see on release report https://github.com/cms-sw/cmssw/releases/CMSSW_14_0_0

Note that, there is also issue in PU relvals, for example in https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0DisplacedSUSY_14TeV__STD_2026D98_PU_240302_001633_312 https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_Pt15To7000_Flat_14__STD_2026D98_PU_240302_001955_88 https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_14_0_0QCD_Pt_1800_2400_14__STD_2026D98_PU_240302_001939_1360 with error

[0] Processing Event run: 1 lumi: 45 event: 2235 stream: 0
[1] Running path 'L1TrackTrigger_step'
[2] Calling method for module L1FPGATrackProducer/'l1tTTTracksFromExtendedTrackletEmulation'
Additional Info:
[a] Fatal Root Error: @SUB=TDecompLU::DecomposeLUCrout
matrix is singular

I have not reproduced the error of PU yet, as it needs to mix samples.

cmsbuild commented 4 months ago

cms-bot internal usage

cmsbuild commented 4 months ago

A new Issue was created by @srimanob.

@rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

srimanob commented 4 months ago

assign l1

cmsbuild commented 4 months ago

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

aloeliger commented 4 months ago

@BenjaminRS

BenjaminRS commented 4 months ago

@Jingyan95 - can you have a look at this please? Is it somehow related to https://github.com/cms-sw/cmssw/issues/41357 ?

makortel commented 4 months ago

assign upgrade

cmsbuild commented 4 months ago

New categories assigned: upgrade

@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks

BenjaminRS commented 4 months ago

I should point out that I believe this is the Tracker group's code rather than L1Trigger

aehart commented 4 months ago

I was able to fix the first exception in #44427.

If there is a recipe for the exception in PU relvals, I can try to debug that one as well.

srimanob commented 3 months ago

Hi @aehart Thanks. For PU, it is a bit difficult to start from GEN-SIM and random is used to mix. The way seems to be re-run L1 on top of RAW (which skipped L1 first, so that issue is still there).

aehart commented 3 months ago

I was able to reproduce the exception seen in PU relvals by copying the PSet.pkl from one of the job logs. I traced this to a numerical stability issue, which I fixed in #44471.

Once that is merged, I think this issue is resolved, as far as I can see.

srimanob commented 2 months ago

We still see the issue in CMSSW_14_1_0_pre3 where https://github.com/cms-sw/cmssw/pull/44471 was merged (see release log), see reports in https://github.com/cms-sw/cmssw/pull/44471#issuecomment-2096772326 https://github.com/cms-sw/cmssw/pull/44471#issuecomment-2096810504

srimanob commented 2 months ago

Note that, something is very strange to me. We don't see this issue at all in 14_0_6 while we see too many fail jobs in 14_1_0_pre3. The only issue I see in 14_0_6 is TripleMU_i84 NULL pointer, which I contact L1T separately. From my check, I don't see his L1FPGSTrackProducer/l1tTTTracksFrom.. at all.

skinnari commented 2 months ago

@srimanob is there a recipe for how to reproduce the crash? it is difficult for us to debug otherwise.

a note on the releases -- there are some possibly relevant PRs that were included in 14_1_0_pre3, that are not in 14_0_6.

srimanob commented 2 months ago

Hi @skinnari

Here is the recipe,

cmsrel CMSSW_14_1_0_pre3
cd CMSSW_14_1_0_pre3/src/
cmsenv
cmsDriver.py step2 -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T33 --datatier GEN-SIM-DIGI-RAW -n -1 --eventcontent FEVTDEBUGHLT --geometry Extended2026D110 --era Phase2C17I13M9 --python step2.py --no_exec --filein file:step1.root --fileout file:step2.root --nThreads 8 --nStreams 2
ln -s /eos/cms/store/group/offcomp_upgrade-sw/srimanob/L1T/1410pre3-debug/step1-13.root ./step1.root
cmsRun step2.py
srimanob commented 2 months ago

With the private production, I confirm that the crash seems to appear in the 14_1 only, I don't see it when I produce the sample with 14_0_6.

aehart commented 2 months ago

I was able to reproduce the crash locally in 14_1_0_pre3, and with debugging symbols, the backtrace points to this line: https://github.com/cms-sw/cmssw/blob/43944b8074465f15907d5f89d6d24e2cb1f6bc86/L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc#L197

I can't see how this line could be the cause though, so I guess there is some kind of memory mismanagement somewhere else that is the actual cause. I will keep playing with it…

Dr15Jones commented 2 months ago

I can't see how this line could be the cause though, so I guess there is some kind of memory mismanagement somewhere else that is the actual cause. I will keep playing with it…

The line

https://github.com/cms-sw/cmssw/blob/43944b8074465f15907d5f89d6d24e2cb1f6bc86/L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc#L194

makes use of a variable sized array which is NOT supported by the C++ standard (but almost all compilers support it). The problem is this can use lots of stack memory and can exceed the allowed space on a stack. Switching to a dynamic container to see if that solves the problem.

aehart commented 2 months ago

Switching to a dynamic container to see if that solves the problem.

This seems to be a good suggestion. I switched dupMap (and also noMerge) from C-style arrays to vectors: https://github.com/cms-sw/cmssw/compare/CMSSW_14_1_0_pre3...aehart:cmssw:2ecd340123eb2efb73108d92bf8a799d4563362a With this, the job that previously crashed is able to run to completion.

If this seems like a reasonable fix, I can open a PR right away.

srimanob commented 2 months ago

Thanks very much @Dr15Jones @aehart for suggestion and test. Do you somehow understand why it does not happen in 14_0? Do we just about at the limit in 14_1 due to some modules?

(1) However, it seems to be on the safe side if you make the backport to 14_0, right? (2) Is this the only place that uses variable sized array in L1T code? Could this be review and fix overall? @aloeliger @epalencia

Thx.

aehart commented 2 months ago

Do you somehow understand why it does not happen in 14_0? Do we just about at the limit in 14_1 due to some modules?

That's my guess, although it could also be related to removing the bins used in the PurgeDuplicate class (https://github.com/cms-sw/cmssw/commit/f68c199d572b0c64cc969833398ef1268b248d31). The value of numStublists would be smaller in each of the bins we had before, so these problematic arrays would also be smaller. I haven't tested that this is why we don't see this problem in 14_0, but I think it makes sense.

(1) However, it seems to be on the safe side if you make the backport to 14_0, right?

There's no harm in backporting this to 14_0, so I can do that as well.

(2) Is this the only place that uses variable sized array in L1T code? Could this be review and fix overall?

I've only checked the L1Trigger/TrackFinding* packages by recompiling them with the -Werror=vla flag, but there seem to be no more instances of this particular problem there.

aehart commented 2 months ago

I've only checked the L1Trigger/TrackFinding* packages by recompiling them with the -Werror=vla flag, but there seem to be no more instances of this particular problem there.

Just for fun, here is a table of all variable-length arrays in L1Trigger in CMSSW_14_1_0_pre3. I leave it to the experts of other subpackages to fix them, but hopefully this is a useful starting point.

File name Line number Name of offending array
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 1004 useFit
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 122 useFitSL1
L1Trigger/DTTriggerPhase2/src/MuonPathAssociator.cc 125 useFitSL3
L1Trigger/L1TCaloLayer1/src/UCTRegion.cc 132 activeTower
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 264 idxMu
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 265 muPtSorted
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 333 idxEg
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 334 egPtSorted
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 364 idxTau
L1Trigger/L1TGlobal/plugins/GenToInputProducer.cc 365 tauPtSorted
L1Trigger/L1TGlobal/src/CorrCondition.cc 369 InvDeltaRSqLUT
L1Trigger/L1TGlobal/src/CorrCondition.cc 370 temp_InvDeltaRSq
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 253 isSeed
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 372 toRemove
L1Trigger/L1THGCal/src/backend/HGCalClusteringImpl.cc 44 isSeed
L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc 150 epbins_default
L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc 196 epbins
L1Trigger/L1TTrackMatch/plugins/L1TrackJetProducer.cc 140 epbins_default
L1Trigger/L1TTrackMatch/plugins/L1TrackJetProducer.cc 179 epbins
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 290 work
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 304 halfsorted
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 304 work
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_hybrid_sort_ref.h 333 tomerge
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 113 OutTmp
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 128 outTmp2
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 67 out2
L1Trigger/Phase2L1ParticleFlow/interface/common/bitonic_sort_ref.h 70 out3
L1Trigger/Phase2L1ParticleFlow/src/L1TCorrelatorLayer1PatternFileWriter.cc 325 ret
L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc 194 dupMap
L1Trigger/TrackFindingTracklet/src/PurgeDuplicate.cc 202 noMerge
srimanob commented 2 months ago

Thanks @aehart I open the git issue https://github.com/cms-sw/cmssw/issues/44937 to follow up. So we can close this one when no crash in relvals (i.e. next pre-release)