Open slava77 opened 2 days ago
cms-bot internal usage
A new Issue was created by @slava77.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@aloeliger @artlbv FYI
@quinnanm
I can also take a look at this when I get some time, frankly I've been meaning to for a while.
In my test L1TGlobalProducer::produce takes 9.7% of the time; in the full menu it's apparently around 2.5%
in a more "realistic" setup it's more like sub-percent
(details from https://github.com/cms-sw/cmssw/pull/45631#discussion_r1704704497)
assign l1
type performance-improvements
New categories assigned: l1
@aloeliger,@epalencia you have been requested to review this Pull request/Issue and eventually sign? Thanks
in a more "realistic" setup it's more like sub-percent
My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi
The odd part there is that e.g. in the first iteration in the first 6 jobs L1TGlobalProducer
takes under 1%, but in the last two it's around 6%
My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi
I'd rather trust more the manual measurement than what comes from the timing server, tbh.
essentially all of the cost is in dlopen
and dlclose
calls.
https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/v1.1.3/src/hls4ml/emulator.cc#L12-L22
create_model
has orders of magnitude lower cost
perhaps just having the .so file loaded in the constructor or begin job is enough
IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables.
The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu.
IMO it would be good to have some common approach to this loading of HLS4ML models within L1/CMSSW, as e.g. here it seems to be done differently: https://github.com/cms-sw/cmssw/blob/CMSSW_14_2_X/L1Trigger/Phase2L1ParticleFlow/interface/JetId.h#L50-L53
The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu.
Given that the interface of hls4mlEmulator::Model
is not const
, the best that one can do is to call ModelLoader::load_model()
once per module per stream (in constructor of edm::stream
module, or using StreamCache
with edm::global
module, or in constructor of edm::one
module).
This is how https://github.com/cms-sw/cmssw/blob/92333e3acb2db4776c5772fbd71e3ef325a9b46e/L1Trigger/Phase2L1ParticleFlow/interface/JetId.h#L50-L53 is used in https://github.com/cms-sw/cmssw/blob/92333e3acb2db4776c5772fbd71e3ef325a9b46e/L1Trigger/Phase2L1ParticleFlow/plugins/L1BJetProducer.cc#L51-L52
(now if hls4mlEmulator::Model
interface would be const
and thread-efficient, we could avoid the per-stream copies of the hls4mlEmulator::Model
)
Looking at a profile of HLT in 14_1_X with callgrind, on MC running only
MC_ReducedIterativeTracking_v22
, I see that 78% ofL1TGlobalProducer::produce
is spent inl1t::AXOL1TLCondition::evaluateCondition
https://github.com/cms-sw/cmssw/blob/CMSSW_14_1_0_pre5/L1Trigger/L1TGlobal/src/AXOL1TLCondition.cc#L100In my test
L1TGlobalProducer::produce
takes 9.7% of the time; in the full menu it's apparently around ~2.5%~ 0.9% (updated to 0.9%, see notes below)Of all the time spent in
l1t::AXOL1TLCondition::evaluateCondition
hls4mlEmulator::ModelLoader::load_model()
is 54%hls4mlEmulator::ModelLoader::~ModelLoader()
is 30%GTADModel_emulator_v4::predict()
is 15%IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables. Some kind of caching may be useful to get HLT to run a bit faster (seems like ~1.5%~ (updated) 0.6% or so).