cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

Irreproducible histogram entries in L1TStage2CaloLayer2 #24279

Open Dr15Jones opened 6 years ago

Dr15Jones commented 6 years ago

So I was looking in detail at the results from L1TStage2CaloLayer2. The histograms that are showing differences are booked in this directory: https://github.com/cms-sw/cmssw/blob/8af7aff587d8f0cec517e6606d0ae4153fd20668/DQM/L1TMonitor/src/L1TStage2CaloLayer2.cc#L78

and are filled within this if block: https://github.com/cms-sw/cmssw/blob/8af7aff587d8f0cec517e6606d0ae4153fd20668/DQM/L1TMonitor/src/L1TStage2CaloLayer2.cc#L239

The comparison tests say only TausPhi, TausEtEtaPhi and TausOcc differ by 1 entry, while TausEta and TausQual histograms are filled that the same time but do not show any difference in entries. I interpret this to mean that we do have the exact same number of Taus in the two jobs. Now the three histograms showing differences all use Phi in their binning. In we had a Phi that was shifted to be out of range of the histogram binning, that would explain the histograms not depending on Phi in the two job show the same entries while the ones use Phi differ by one.

I believe @slava77 has said that some of the L1 plots are known to have differences based on changes in underflow/overflow, so this is probably one of those cases.

[originally comment in #24270 ]

cmsbuild commented 6 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 6 years ago

Looking deeper at workflow 10224, I see the histograms arew generated in step3 by the module labelled l1tStage2CaloLayer2. Dumping that module gives

  >>> print process.l1tStage2CaloLayer2.stage2CaloLayer2TauSource
  cms.InputTag("caloStage2Digis","Tau")

Then looking at caloStage2Digis we see

  >>> print process.caloStage2Digis.dumpPython()
  cms.EDProducer("L1TRawToDigi",
      FWId = cms.uint32(0),
      FWOverride = cms.bool(False),
      FedIds = cms.vint32(1360, 1366),
      MinFeds = cms.uint32(1),
      Setup = cms.string('stage2::CaloSetup'),
      TMTCheck = cms.bool(True)
  )

This appears to use plugins to do the actual creating of the data to be put into the event. It appears the Taus have their hardware Phi set here:

https://github.com/cms-sw/cmssw/blob/78c11d459eefc961686b4b122d2e2850b0cb3ab2/EventFilter/L1TRawToDigi/plugins/implementations_stage2/TauUnpacker.cc#L62

the value of raw_data comes from the object passed into the function:

https://github.com/cms-sw/cmssw/blob/78c11d459eefc961686b4b122d2e2850b0cb3ab2/EventFilter/L1TRawToDigi/plugins/implementations_stage2/TauUnpacker.cc#L44
Dr15Jones commented 6 years ago

The allowed values for the Phi histogram is 0 10 143: https://github.com/cms-sw/cmssw/blob/8af7aff587d8f0cec517e6606d0ae4153fd20668/DQM/L1TMonitor/src/L1TStage2CaloLayer2.cc#L82

From the data encoding, phi can be from 0 to 255. Therefore if the high bit was someone set on, then we could get an overflow. However, looking at the value of the 'missing' Phi, it appears to be ~134 which is a value that already has the 8th bit set.

Dr15Jones commented 6 years ago

image

Dr15Jones commented 6 years ago

assign l1

cmsbuild commented 6 years ago

New categories assigned: l1

@rekovic,@nsmith-,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 6 years ago

From the best I can tell, here is where the packing is done https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/EventFilter/L1TRawToDigi/plugins/implementations_stage2/TauPacker.cc#L24

fabiocos commented 6 years ago

@Dr15Jones I think that the variations observed in https://github.com/cms-sw/cmssw/pull/24274#issuecomment-412595608 falls in the same category

fabiocos commented 6 years ago

@rekovic @thomreis there has been any follow-up on this issue? We got reports of non reproducibility even recently...

rekovic commented 6 years ago

@Dr15Jones @fabiocos We are looking into this. attn @thomreis @bundocka

fabiocos commented 6 years ago

@rekovic @thomreis this is now hitting also in the test of unrelated code as TBB: https://github.com/cms-sw/cmsdist/pull/4415

thomreis commented 6 years ago

I have tried to run WF 136.788 in CMSSW_10_3_0_pre6 with and without PR #24837 and get exactly the same number of entries (759) in both cases. Also in agreement with the blue spectrum in [1]. I don't know if the blue one is with or without 24837. So it looks as if the error is not reproducible by just running the matrix.

[1] https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_3_X_2018-10-09-1100+24837/28840/136.788_RunSinglePh2017B+RunSinglePh2017B+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017/L1T_L1TStage2CaloLayer2_NonIsolated-Tau.html

kpedro88 commented 6 years ago

There is probably uninitialized memory somewhere. I recommend running valgrind. NB, you should use cmsRunGlibC instead of cmsRun with valgrind.