Open missirol opened 10 months ago
cms-bot internal usage
A new Issue was created by @missirol Marino Missiroli.
@sextonkennedy, @smuzaffar, @rappoccio, @Dr15Jones, @makortel, @antoniovilela can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign daq, dqm, hlt
New categories assigned: daq,dqm,hlt
@Martin-Grunewald,@mmusich,@emeschi,@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks
In the short term, we plan to cache output of each process in hltd and in the merger service which handle HLT output (i.e. at every stage of fastHadd merging) and in this way always merge a full set of process files to avoid missing statistics in case a fraction of jobs doesn't produce output in a lumisection. The latest file written out by a process will be cached, assuming statistics is most up to date for the last file even if lumisections aren't closed in the same order they are opened in the HLT job.
The latest file written out by a process will be cached, assuming statistics is most up to date for the last file even if lumisections aren't closed in the same order they are opened in the HLT job.
"latest" as in "most recent in time" or as in "highest lumisection number" ?
(I think we should use the "highest lumisection number" processed by each job)
I assumed that the most recent output from a job will have more up to date statistics. When N+1 is closed before N, won't N version of histograms be filled for both N+1 and N? If that is true, then taking the most recent output will be more complete.
Orthogonal to which way is correct, It's not a big difference and less serious than the current problem. The worst case is that a small number of lumisections (towards the end) remain incomplete. Actually framework (still) must cycle all streams through all lumis and in practice they will get closed in the same order as they are opened, so it should be same...
The impact of the interplay between the current DQM and merger systems can easily be spotted in the DQM plots from the first throughput measurement with 1200b done on 14/04/2024.
The second plot shows only a partial statistics, corresponding to a loss of about 13% of the FUs (or BUs).
I understand that @smorovic has implemented a workaround in the micro-merger step, where the latest DQM plot from any job are kept and used if no plots are produced for a given lumisection.
In this case, the remaining loss in statistics seems to e due to the mini- and macro-merger steps.
Coming back (late) to this, I understood the following.
Earlier this year between Jan-22 and Jan-29, DAQ (Srecko) updated the micro-merger step to cache the DQMHistograms
files produced by the different CMSSW jobs on each FU [1] (I'm not sure when this update was deployed online, maybe it happened as part of CMSONS-15114).
On Jun-6 (2024), DAQ also updated the mini- and macro-merger steps in order to cache DQMHistograms
files from all FUs/RUBUs [2].
This issue was also discussed in CMSONS-15074.
This effectively solves the issue. I checked a few recent runs, and I could not find evidence of DQM histograms with missing entries (I looked at the rate of "Calibration" events, which is constant in collisions runs, and it was one of the plots where the issue was easier to see).
In principle, this issue could be closed, unless experts prefer to use it to discuss any further improvements.
[1] From DAQ weekly reports on Jan-29, 2024.
Prepared fastHadd caching in hltd (for the next version):
- Always merge last produced output (last LS with events) of each CMSSW process (including crashed jobs) to have complete statistics in runs ending with low rate.
- Currently only micro-merging level is covered - any FU having events in a LS will produce output
[2] See DAQ weekly reports on Jun-6, 2024.
Mergers updated to cache DQMHistogram files from all FUs/RUBUs
A few weeks ago, @sanuvarghese (TSG/STEAM) noticed that some of the HLT-related plots in the online DQM GUI showed incorrect values.
hlt_clientPB
, which processes.pb
files named*DQMHistograms*
produced at HLT..pb
files. An example of such a plugin isTriggerRatesMonitor
. TheFastTimerService
is another example of a module running inside the HLT and producing DQM outputs included in said.pb
file.*DQMHistograms*
files for those last LSs are based only on a small subset of HLT processes, and DQM ends up using this to produce the final DQM outputs (for all LSs). This means that, when the run stops, the final DQM plots may contain only the entries from a small subset of HLT processes. If this is the case, the same DQM plots were probably displaying the correct counts when that given run was ongoing (as the 'merging step' was using the*DQMHistograms*
outputs of all HLT processes, not just a subset of processes).This problem affects all the DQM histograms produced via the
DQMHistograms
stream. These include histograms/profiles used to monitor the CPU performance of the HLT (e.g. timing and throughput) in the online DQM.Below some notes shared by @fwyzard.
and (based on checks he did)
DAQ, DQM and HLT are already in contact to find a solution. I'm opening this issue for documentation purposes, and in case it helps to discuss technical aspects of the problem.
FYI: @smorovic @cms-sw/hlt-l2