cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X #23105

Closed slava77 closed 5 years ago

slava77 commented 6 years ago

There are somewhat random changes in hiSelectedPixelVertex in reco the HI workflows. This started showing up in 20-40% of jenkins tests starting from CMSSW_10_2_X_2018-04-26-2300 (8 out of about 35 tests have this issue, while at least 60 tests before this IB do not have it). This coincides with merging of https://github.com/cms-sw/cmsdist/pull/3957, which advanced from ROOT 6.10.09 to 6.12.07.

HI vertex reco uses either hiPixelAdaptiveVertex (which comes from hiPixelMedianVertex, which is similar to the PrimaryVertexProducer used in the pp reco) or from HIPixelMedianVtxProducer, which has some minuit-based fitting. I suspect that the problem is in RecoHI/HiTracking/plugins/HIPixelMedianVtxProducer.cc

@Dr15Jones @dan131riley please take a look at HIPixelMedianVtxProducer to see if it has some known features that may have changed with the new ROOT.

cmsbuild commented 6 years ago

A new Issue was created by @slava77 Slava Krutelyov.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 commented 6 years ago

assign reconstruction,core

cmsbuild commented 6 years ago

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 6 years ago

@smuzaffar @slava77 by Jenkins tests, do you mean the RelVals or tests done for pull request? I had thought the pull request tests were always run single threaded.

slava77 commented 6 years ago

On 5/1/18 2:14 PM, Chris Jones wrote:

@smuzaffar https://github.com/smuzaffar @slava77 https://github.com/slava77 by Jenkins tests, do you mean the RelVals or tests done for pull request? I had thought the pull request tests were always run single threaded.

This is for tests of pull requests, in tests of unrelated PRs. I think that these are all single threaded.

makortel commented 6 years ago

Since "minuit" and "fitting" were mentioned, let me note that e.g. in #23065 (some of?) the *_Mean and *_Sigma histograms produced by DQMGenericClient start to show differences as well (with their input histograms staying the same). See here https://github.com/cms-sw/cmssw/pull/23065#issuecomment-385767088 and onwards for more discussion there.

Just thinking if the root cause for these irreproducibilities could be the same as with hiSelectedPixelVertex.

slava77 commented 6 years ago

@makortel I confirm there are also issues with the reproducibility of DQM plots. This is happening in many workflows, which is much more inconvenient in checking out PRs

dan131riley commented 6 years ago

The only substantive minuit change that I see is https://github.com/cms-sw/root/commit/5efb8d6442f61b66a7020b3adc4f2280a5bdee7d#diff-3e4e3008064c935331e2a484cd9157a1 but that shouldn't make any difference for single-threaded jobs, and looking at the logs confirms that these jobs were single-threaded.

Expanding scope to all of root/math, I still don't see any changes that I would expect to affect single-threaded jobs.

Dr15Jones commented 6 years ago

So the DQMGenericClient histograms appear to come from here https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/DQMServices/ClientConfig/plugins/DQMGenericClient.cc#L865

Dr15Jones commented 6 years ago

So it appears the other fit @slava77 was thinking about is

https://github.com/cms-sw/cmssw/blob/4f8dcb180b08a30fae793c4c8b99bcb399e045c6/RecoHI/HiTracking/plugins/HIPixelMedianVtxProducer.cc#L164

Dr15Jones commented 6 years ago

@slava77 @makortel do either of you have an exact workflow showing the problem? I'm thinking a run of valgrind is in order.

slava77 commented 6 years ago

140.53 step2 is the place where it happens for the reco values.

DQM diffs in jenkins show up in many places (in 25 out of 29 running workflows in one of the cases that I checked)

makortel commented 6 years ago

@Dr15Jones for the DQM e.g. workflow 10024.0 HARVESTING step.

dan131riley commented 6 years ago

@slava77 in what sense are the changes "somewhat random"? Are there different changes from run to run, or are they consistent but with no obvious pattern? Comparing, for example

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-04-30-1100+23065/26340/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017/Tracking_TrackFromPV_generalFromPVPt09_trackingParticleRecoAsssociation.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-01-1100+23106/26350/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017/Tracking_TrackFromPV_generalFromPVPt09_trackingParticleRecoAsssociation.html

the differences in the failing plots look identical to me, which has implications for what kind of change we are hunting.

makortel commented 6 years ago

@dan131riley Earlier these histograms had exactly the same values, unless any of the input histograms to these fits changed. In that case there has been numerical changes also in the fit results of unrelated histograms (because of some internal state of Minuit as far as I've understood).

Dr15Jones commented 6 years ago

@pcanal FYI

Dr15Jones commented 6 years ago

One thing I noticed is in both places which seem to be showing differences, neither of them check that the fit did not fail.

dan131riley commented 6 years ago

Between the versions there are substantial changes to the mathcore Fitter and hist TF1 classes, mostly for IMT and vectorization. I can certainly imagine those could change what gets left in the results of failed fits.

dan131riley commented 6 years ago

That theory isn't looking good...I modified HIPixelMedianVtxProducer with

auto status = histo.Fit(&f1,"QN");
if (status) {
  edm::LogSystem("HIPixelMedianVtxProducer") << "Fit failed " << med + f1.GetParameter(1) << " +- " << f1.GetParError(1) << " cm";
}

and ran 140.53 step2 in CMSSW_10_2_0_pre1 and CMSSW_10_2_0_pre1_ROOT612. I got two fit failures (on the same event) in each, any they all printed the same values (to the resolution of the printout) in both releases:

Fit failed 7.6079 +- 0.145983 cm

Next step, print the result on every event and see if they ever differ, or if this is just the wrong tree.

dan131riley commented 6 years ago

Looking at all fits and boosting the precision I get one difference:

-Fit result 6.041964245417202228 +- 0.001913964442836382784 cm
+Fit result 6.041964245416591162 +- 0.001913964442221971485 cm

I have no idea if that's large enough to produce the differences @slava77 saw.

slava77 commented 6 years ago

On 5/2/18 11:12 AM, Dan Riley wrote:

Looking at all fits and boosting the precision I get one difference:

|-Fit result 6.041964245417202228 +- 0.001913964442836382784 cm +Fit result 6.041964245416591162 +- 0.001913964442221971485 cm |

I have no idea if that's large enough to produce the differences @slava77 https://github.com/slava77 saw.

The comparison is rather strict. So, I think that this difference can trigger the report of a difference for the DQM plots.

The change in hiSelectedPixelVertex is much more significant in the end. I don't know if this is just a butterfly effect or if in this case the result change a lot from the start.

dan131riley commented 6 years ago

The DQMGenericClient fit is actually from here:

https://github.com/cms-sw/cmssw/blob/fcaf82171989353b11cafb15c3f20e6a6bf4dd70/DQMServices/ClientConfig/src/FitSlicesYTool.cc#L14

from here:

https://github.com/cms-sw/cmssw/blob/fcaf82171989353b11cafb15c3f20e6a6bf4dd70/DQMServices/ClientConfig/plugins/DQMGenericClient.cc#L718

slava77 commented 6 years ago

curiously: in https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23119/26369/ there are no changes neither in DQM nor in the HI. OTOH, the same IB but a different PR https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/ has differences both in the HI and in the DQM plots in many workflows (not related to the PR).

It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference.

dan131riley commented 6 years ago

On May 3, 2018, at 12:16, Slava Krutelyov notifications@github.com wrote:

It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference.

It's not entirely consistent, even...compare

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/

One has 158 failures, the other 181. The difference is 23 additional failures in JetMET/Jet/CleanedslimmedJetsPuppi, and these failures, in

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/JetMET_Jet_CleanedslimmedJetsPuppi.html

are not mean & sigma like all the others. Guess valgrind is in order.

slava77 commented 6 years ago

On 5/3/18 10:02 AM, Dan Riley wrote:

On May 3, 2018, at 12:16, Slava Krutelyov notifications@github.com wrote:

It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference.

It's not entirely consistent, even...compare

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/

One has 158 failures, the other 181. The difference is 23 additional failures in JetMET/Jet/CleanedslimmedJetsPuppi, and these failures, in

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/JetMET_Jet_CleanedslimmedJetsPuppi.html

this is a part of the PR changes in #23120 The "Puppi" plots are expected to change here.

are not mean & sigma like all the others. Guess valgrind is in order.

smuzaffar commented 6 years ago

We can run the comparison for same IB twice to see if it generate random results.

dan131riley commented 6 years ago

Running valgrind on step4 gives thousands of uninitialized values used with tracebacks like

==406690==    by 0x1BFE5AA3: Comp2RefChi2::setMessage()
==406690==    by 0x1BFED16D: MonitorElement::runQTests()
==406690==    by 0x1BFF7458: DQMStore::runQTests()
==406690==    by 0x3845FFC0: QualityTester::performTests()
==406690==    by 0x4B9254A: edm::EDAnalyzer::doEndLuminosityBlock(edm::LuminosityBlockPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*)

because a Comp2RefChi2 is getting used without being initialized. Modifying https://github.com/cms-sw/cmssw/blob/e6ba533090d6818a2cb8be31be56d5fb21802d26/DQMServices/Core/interface/QTest.h#L209 to initialize to 0 makes all of them go away but doesn't change the results, and there are no further hints in the valgrind memcheck results.

Using ROOT 6.12 on step4 (always with the same step3 input) I reproduce the comparison failures with a 6.10 baseline, but I don't get the same differences as the PR tests. I do get consistent differences running on a single system/architecture, but I get different changes running on different systems/architectures. For example, comparing a run with 6.12 on a Skylake machine vs. a Haswell system I get comparison failures in the same places but they are numerically different and not always in the same bins. Don't know yet if it's different on every system or tied to the architecture. I don't see that kind of variation on 6.10.

The changes are all too small to discern on the comparison graphs, but still disturbing.

dan131riley commented 6 years ago

Adding

 if (startDir == "Tracking/TrackFromPV/generalFromPVPt09_trackingParticleRecoAsssociation" && namePrefix == "h_dxypulleta") {
    srcME->getTH2F()->Print("all");
    meanME->getTH1()->Print("all");
    sigmaME->getTH1()->Print("all");
  }

to DQMGenericClient::computeResolution and running on two different systems, srcME is identical to the precision of the printouts, meanME and sigmaME are indeed different:

-TH1.Print Name  = h_dxypulleta_Mean, Entries= 60, Total sum= 20.963
+TH1.Print Name  = h_dxypulleta_Mean, Entries= 60, Total sum= 22.5034

-TH1.Print Name  = h_dxypulleta_Sigma, Entries= 60, Total sum= 277.407
+TH1.Print Name  = h_dxypulleta_Sigma, Entries= 60, Total sum= 277.383

and there are lots of differences in individual bins, some pretty significant:

- fSumw[16]=19.2041, x=-1.45, error=37.0525
- fSumw[17]=8.21385, x=-1.35, error=29.8819
+ fSumw[16]=19.21, x=-1.45, error=34.4841
+ fSumw[17]=8.2056, x=-1.35, error=25.2786

Running twice on the same machine I get no differences. Next step is to package up srcME and see if I can make a small test case.

dan131riley commented 6 years ago

The inconsistencies between different platforms goes away if I either change FitSlicesYTool to do

h->FitSlicesY(nullptr, 0, -1, 0, "QNR SERIAL");

instead of

h->FitSlicesY();

or if I add to the config

process.add_(cms.Service("InitRootHandlers", EnableIMT = cms.untracked.bool(False)))

to turn off IMT. This is curious, since these jobs are single-threaded. InitRootHandlers does call ROOT::EnableImplicitMT() regardless of whether the job is multi-threaded (@Dr15Jones is this a bug or a feature?). I don't see extra threads being created, but apparently just having IMT on changes the fit evaluation in system-dependent ways.

My test program does not, so far, reproduce this behavior.

dan131riley commented 6 years ago

For non-threaded jobs, cmsRun does set the number of TBB threads to 1, so there aren't additional threads created. IMT does get used in ROOT::Fit::FitUtil::EvaluateChi2 to create a bunch of tasks evaluating the chi2, and this is new in 6.12. It's at least plausible that, even with only one thread, the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors. I don't see any random numbers being used in the part that gets broken into TBB tasks (and I would expect much larger differences if there were). It's possible that we have some pathological cases where the order of summation makes a difference.

I would think we'd want to turn off IMT for the PR tests, possibly turn it off by default when we aren't multi-threaded. Setting the "SERIAL" fit option is also a possibility, especially given that we know the default MINUIT minimizer can get into trouble due to its reliance on thread local storage.

makortel commented 6 years ago

I think the "SERIAL" fit option would be good enough for both cases (I doubt the threading would bring any benefit in practice, and for DQMGenericClient HARVESTING is supposed to be run single-threaded anyway).

Dr15Jones commented 6 years ago

I also believe that "SERIAL" is the best way to go.

dan131riley commented 6 years ago

Adding "SERIAL" to the fit in HIPixelMedianVtxProducer also resolves the one difference I saw there, so I'll make a PR with both of those.

Dr15Jones commented 6 years ago

the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors

If a parallel_for is being used by ROOT for this code then I would definitely bet that TBB would 'chunk' up the processing range of the for to be different based on the number of cores on the machine.

makortel commented 6 years ago

There may be further cases of fits (or similar irreproducibilities), see #23160.

Dr15Jones commented 6 years ago

@dan131riley ROOT does chunk based on number of CPUs. The chunk size for the integration is done here https://github.com/root-project/root/blob/5b25608c138fa826cb8bca0bebe26d3774b02179/math/mathcore/src/FitUtil.cxx#L393

where the implementation of setAutomaticChunking is here

https://github.com/root-project/root/blob/5b25608c138fa826cb8bca0bebe26d3774b02179/math/mathcore/src/FitUtil.cxx#L1756

So different chunk sizes means different parts being summed together in different orders.

dan131riley commented 6 years ago

@Dr15Jones the ROOT use of the number of CPUs for chunking explains it, it didn't really make sense that TBB would use the number of physical CPUs and not the configured thread limit. PR #23230 submitted.

dan131riley commented 6 years ago

@makortel the 140.53 diffs are actually what led @slava77 to open this issue, and the hope is that the HIPixelMedianVtxProducer patch will resolve it (it does fix the only difference I saw in the step2 vertexing, but I didn't run the entire workflow to verify)

slava77 commented 6 years ago

I was looking at diffs in #23258 made with CMSSW_10_2_X_2018-05-23-1100 (after #23230 was merged) It looks like fits in the following are still affected by non-reproducibility

pcanal commented 6 years ago

@dan131riley In the JIRA ticket (https://sft.its.cern.ch/jira/browse/ROOT-9422) @lmoneta mentions (and asks):

A small numerical different is expected, not a large one. It is possible that in some case the fit converges and in other not, if the fit is rather un-stable. Do you have a reproducible test showing a large difference ?

dan131riley commented 6 years ago

@pcanal I was not able to create a reproducible test case in the time budgeted for it. I suspect we're seeing an amplification of small differences, where we get small differences in one step that are amplified in a subsequent step, but I haven't traced through the operations to see how plausible that is. These are also low statistics, so they could be ill conditioned. So at this point I don't think we have convincing evidence of a ROOT issue. (I see you already fixed the automatic chunking to use the IMT pool size.)

slava77 commented 5 years ago

This is probably ready to close now after the cases mentioned in https://github.com/cms-sw/cmssw/issues/23105#issuecomment-391592135 are addressed