Partial stats in outputs of online-DQM client `hlt_clientPB` after a run ends

missirol commented 10 months ago

A few weeks ago, @sanuvarghese (TSG/STEAM) noticed that some of the HLT-related plots in the online DQM GUI showed incorrect values.

The plots are the ones produced by the online-DQM client hlt_clientPB, which processes .pb files named *DQMHistograms* produced at HLT.
The DQM plugins in question run inside the HLT menu, and their outputs (histograms) are saved in these .pb files. An example of such a plugin is TriggerRatesMonitor. The FastTimerService is another example of a module running inside the HLT and producing DQM outputs included in said .pb file.
For example, this plot shows the issue. The actual rate here is ~100 Hz for every LS. This means we expect ~2330 per bin (per LS), but we see the expected value only for LS=697 (the 2nd-to-last LS).
After discussing this with DAQ and DQM experts, the issue seems to be the following: in the last LSs of a run, rates can reduce drastically (e.g. no more stable beams) and only some of the HLT processes accept events, so the merged *DQMHistograms* files for those last LSs are based only on a small subset of HLT processes, and DQM ends up using this to produce the final DQM outputs (for all LSs). This means that, when the run stops, the final DQM plots may contain only the entries from a small subset of HLT processes. If this is the case, the same DQM plots were probably displaying the correct counts when that given run was ongoing (as the 'merging step' was using the *DQMHistograms* outputs of all HLT processes, not just a subset of processes).

This problem affects all the DQM histograms produced via the DQMHistograms stream. These include histograms/profiles used to monitor the CPU performance of the HLT (e.g. timing and throughput) in the online DQM.

Below some notes shared by @fwyzard.

When the DQMHistograms stream was redesigned during Run-2, the behaviour that we were aiming for was to

fill the histograms inside the HLT jobs

every lumisection the DQMFileSaver would write the content of all DQM histograms to a file (in protobuf format, but this is a detail) an reset the histograms in memory

the files would be merged across the whole farm and sent to the DQM jobs

a DQM job would read the lumisection worth of histograms and add it to the histograms in memory

This way, if some lumisections were missing or arrived out of order, the bulk of the histograms would still be correct. If a lumisection arrived much later than the others (for example because of a long running event at HLT) its content would simply go into the right place. And, even if we never anticipated it, it would have prevented the problem we are seeing today :-/

and (based on checks he did)

[..]

the .pb files that HLT and the merger system send to the DQM during the run are correct

the final plots stored in the DQM GUI are somewhat incomplete.

Our interpretation is that the final plots are those that were received for the last lumisection.

When the run is stopped with a fill still ongoing, the L1T rate is still pretty high, and most or all HLT processes do receive some data for that last lumisection. In this case, (almost) all HLT processes will produce a DQMHistograms file, and the (almost) full statistics will be available in the merged .pb file, and so in the DQM GUI.

When the run is stopped after the beams have been dumped, the L1T rate is going to be very low. [..] As the HLT nominally has 1600 jobs, only a small fraction will process any data during the last lumisection, and produce a .pb file. The merger will pick up only those files, and so the merged .pb file will contain only a small part of the original DQM histogram counts. When this .pb file is read by the GUI, its content will replace those of the intermediate histograms, and the end result will be histograms with only a partial statistics (except for the last lumisection itself, which will be complete, by construction).

DAQ, DQM and HLT are already in contact to find a solution. I'm opening this issue for documentation purposes, and in case it helps to discuss technical aspects of the problem.

FYI: @smorovic @cms-sw/hlt-l2

cmsbuild commented 10 months ago

cms-bot internal usage

cmsbuild commented 10 months ago

A new Issue was created by @missirol Marino Missiroli.

@sextonkennedy, @smuzaffar, @rappoccio, @Dr15Jones, @makortel, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 10 months ago

assign daq, dqm, hlt

cmsbuild commented 10 months ago

New categories assigned: daq,dqm,hlt

@Martin-Grunewald,@mmusich,@emeschi,@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

smorovic commented 10 months ago

In the short term, we plan to cache output of each process in hltd and in the merger service which handle HLT output (i.e. at every stage of fastHadd merging) and in this way always merge a full set of process files to avoid missing statistics in case a fraction of jobs doesn't produce output in a lumisection. The latest file written out by a process will be cached, assuming statistics is most up to date for the last file even if lumisections aren't closed in the same order they are opened in the HLT job.

fwyzard commented 10 months ago

The latest file written out by a process will be cached, assuming statistics is most up to date for the last file even if lumisections aren't closed in the same order they are opened in the HLT job.

"latest" as in "most recent in time" or as in "highest lumisection number" ?

fwyzard commented 10 months ago

(I think we should use the "highest lumisection number" processed by each job)

smorovic commented 10 months ago

I assumed that the most recent output from a job will have more up to date statistics. When N+1 is closed before N, won't N version of histograms be filled for both N+1 and N? If that is true, then taking the most recent output will be more complete.

Orthogonal to which way is correct, It's not a big difference and less serious than the current problem. The worst case is that a small number of lumisections (towards the end) remain incomplete. Actually framework (still) must cycle all streams through all lumis and in practice they will get closed in the same order as they are opened, so it should be same...

fwyzard commented 7 months ago

The impact of the interplay between the current DQM and merger systems can easily be spotted in the DQM plots from the first throughput measurement with 1200b done on 14/04/2024.

DQM plot observed online while the run was ongoing:

Throughput (retired)

same DQM plot observed online after the end of the run and the final merge:

link to the online GUI

The second plot shows only a partial statistics, corresponding to a loss of about 13% of the FUs (or BUs).

fwyzard commented 7 months ago

I understand that @smorovic has implemented a workaround in the micro-merger step, where the latest DQM plot from any job are kept and used if no plots are produced for a given lumisection.

In this case, the remaining loss in statistics seems to e due to the mini- and macro-merger steps.

missirol commented 3 months ago

Coming back (late) to this, I understood the following.

Earlier this year between Jan-22 and Jan-29, DAQ (Srecko) updated the micro-merger step to cache the DQMHistograms files produced by the different CMSSW jobs on each FU [1] (I'm not sure when this update was deployed online, maybe it happened as part of CMSONS-15114).
On Jun-6 (2024), DAQ also updated the mini- and macro-merger steps in order to cache DQMHistograms files from all FUs/RUBUs [2].
This issue was also discussed in CMSONS-15074.

This effectively solves the issue. I checked a few recent runs, and I could not find evidence of DQM histograms with missing entries (I looked at the rate of "Calibration" events, which is constant in collisions runs, and it was one of the plots where the issue was easier to see).

In principle, this issue could be closed, unless experts prefer to use it to discuss any further improvements.

[1] From DAQ weekly reports on Jan-29, 2024.

Prepared fastHadd caching in hltd (for the next version):

Always merge last produced output (last LS with events) of each CMSSW process (including crashed jobs) to have complete statistics in runs ending with low rate.

Currently only micro-merging level is covered - any FU having events in a LS will produce output

[2] See DAQ weekly reports on Jun-6, 2024.

Mergers updated to cache DQMHistogram files from all FUs/RUBUs

mmusich commented 3 months ago

+hlt

see https://github.com/cms-sw/cmssw/issues/43710#issuecomment-2296117826

cms-sw / cmssw

Partial stats in outputs of online-DQM client `hlt_clientPB` after a run ends #43710

DQM plot observed online while the run was ongoing:

same DQM plot observed online after the end of the run and the final merge: