cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Partial stats in outputs of online-DQM client `hlt_clientPB` after a run ends #43710

Open missirol opened 8 months ago

missirol commented 8 months ago

A few weeks ago, @sanuvarghese (TSG/STEAM) noticed that some of the HLT-related plots in the online DQM GUI showed incorrect values.

This problem affects all the DQM histograms produced via the DQMHistograms stream. These include histograms/profiles used to monitor the CPU performance of the HLT (e.g. timing and throughput) in the online DQM.

Below some notes shared by @fwyzard.

When the DQMHistograms stream was redesigned during Run-2, the behaviour that we were aiming for was to

  • fill the histograms inside the HLT jobs
  • every lumisection the DQMFileSaver would write the content of all DQM histograms to a file (in protobuf format, but this is a detail) an reset the histograms in memory
  • the files would be merged across the whole farm and sent to the DQM jobs
  • a DQM job would read the lumisection worth of histograms and add it to the histograms in memory

This way, if some lumisections were missing or arrived out of order, the bulk of the histograms would still be correct. If a lumisection arrived much later than the others (for example because of a long running event at HLT) its content would simply go into the right place. And, even if we never anticipated it, it would have prevented the problem we are seeing today :-/

and (based on checks he did)

[..]

  • the .pb files that HLT and the merger system send to the DQM during the run are correct
  • the final plots stored in the DQM GUI are somewhat incomplete.

Our interpretation is that the final plots are those that were received for the last lumisection.

When the run is stopped with a fill still ongoing, the L1T rate is still pretty high, and most or all HLT processes do receive some data for that last lumisection. In this case, (almost) all HLT processes will produce a DQMHistograms file, and the (almost) full statistics will be available in the merged .pb file, and so in the DQM GUI.

When the run is stopped after the beams have been dumped, the L1T rate is going to be very low. [..] As the HLT nominally has 1600 jobs, only a small fraction will process any data during the last lumisection, and produce a .pb file. The merger will pick up only those files, and so the merged .pb file will contain only a small part of the original DQM histogram counts. When this .pb file is read by the GUI, its content will replace those of the intermediate histograms, and the end result will be histograms with only a partial statistics (except for the last lumisection itself, which will be complete, by construction).

DAQ, DQM and HLT are already in contact to find a solution. I'm opening this issue for documentation purposes, and in case it helps to discuss technical aspects of the problem.

FYI: @smorovic @cms-sw/hlt-l2

cmsbuild commented 8 months ago

cms-bot internal usage

cmsbuild commented 8 months ago

A new Issue was created by @missirol Marino Missiroli.

@sextonkennedy, @smuzaffar, @rappoccio, @Dr15Jones, @makortel, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 8 months ago

assign daq, dqm, hlt

cmsbuild commented 8 months ago

New categories assigned: daq,dqm,hlt

@Martin-Grunewald,@mmusich,@emeschi,@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

smorovic commented 8 months ago

In the short term, we plan to cache output of each process in hltd and in the merger service which handle HLT output (i.e. at every stage of fastHadd merging) and in this way always merge a full set of process files to avoid missing statistics in case a fraction of jobs doesn't produce output in a lumisection. The latest file written out by a process will be cached, assuming statistics is most up to date for the last file even if lumisections aren't closed in the same order they are opened in the HLT job.

fwyzard commented 8 months ago

The latest file written out by a process will be cached, assuming statistics is most up to date for the last file even if lumisections aren't closed in the same order they are opened in the HLT job.

"latest" as in "most recent in time" or as in "highest lumisection number" ?

fwyzard commented 8 months ago

(I think we should use the "highest lumisection number" processed by each job)

smorovic commented 8 months ago

I assumed that the most recent output from a job will have more up to date statistics. When N+1 is closed before N, won't N version of histograms be filled for both N+1 and N? If that is true, then taking the most recent output will be more complete.

Orthogonal to which way is correct, It's not a big difference and less serious than the current problem. The worst case is that a small number of lumisections (towards the end) remain incomplete. Actually framework (still) must cycle all streams through all lumis and in practice they will get closed in the same order as they are opened, so it should be same...

fwyzard commented 5 months ago

The impact of the interplay between the current DQM and merger systems can easily be spotted in the DQM plots from the first throughput measurement with 1200b done on 14/04/2024.

DQM plot observed online while the run was ongoing:

Throughput (retired)

same DQM plot observed online after the end of the run and the final merge:

image

link to the online GUI

The second plot shows only a partial statistics, corresponding to a loss of about 13% of the FUs (or BUs).

fwyzard commented 5 months ago

I understand that @smorovic has implemented a workaround in the micro-merger step, where the latest DQM plot from any job are kept and used if no plots are produced for a given lumisection.

In this case, the remaining loss in statistics seems to e due to the mini- and macro-merger steps.

missirol commented 3 weeks ago

Coming back (late) to this, I understood the following.

This effectively solves the issue. I checked a few recent runs, and I could not find evidence of DQM histograms with missing entries (I looked at the rate of "Calibration" events, which is constant in collisions runs, and it was one of the plots where the issue was easier to see).

In principle, this issue could be closed, unless experts prefer to use it to discuss any further improvements.


[1] From DAQ weekly reports on Jan-29, 2024.

Prepared fastHadd caching in hltd (for the next version):

  • Always merge last produced output (last LS with events) of each CMSSW process (including crashed jobs) to have complete statistics in runs ending with low rate.
  • Currently only micro-merging level is covered - any FU having events in a LS will produce output

[2] See DAQ weekly reports on Jun-6, 2024.

Mergers updated to cache DQMHistogram files from all FUs/RUBUs

mmusich commented 3 weeks ago

+hlt