cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Non-reproducibility in JetMET/{Jet,MET}Validation histograms in phase2 workflows #39754

Open makortel opened 1 year ago

makortel commented 1 year ago

It seems that we have non-reproducibility in some JetMET/{Jet,MET}Validation histograms that are visible in PR tests. So far seen (at least) in

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign dqm

FYI @cms-sw/jetmet-pog-l2

cmsbuild commented 1 year ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

smuzaffar commented 1 year ago

These were first seen in https://github.com/cms-sw/cmssw/pull/39699 (see logs] , so could #39699 change is responsible for this?

makortel commented 1 year ago

An earlier test in https://github.com/cms-sw/cmssw/pull/39699#issuecomment-1278855112 reports only 6 DQM histograms with comparison differences, which would suggest that #39699 would not be responsible for the differences (or it at least the answer is less clear).

On the other hand, the occurrence of these differences seem to be random and not very frequent, so it could be that the PR responsible for this has clean comparisons in its tests.

perrotta commented 1 year ago

This got somehow fixed, since the same histos now reproduce nicely. Can this get closed?

makortel commented 1 year ago

Sure

makortel commented 1 year ago

Seems that we are again seeing these

makortel commented 1 year ago

Documenting here https://github.com/cms-sw/cmssw/pull/41019#issuecomment-1463003532 workflow 20834.0 shows differences in

JetMET/METValidation/slimmedMETsPuppi/{METResolution_GenMETTrue_InMETBins, METUnc_ElectronEnDown, METUnc_ElectronEnUp}
JetMET/METValidation/PfMetT0pcT1/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/PfMetT1/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/pfMet/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/pfMetT0pc/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/slimmedMETs/METResolution_GenMETTrue_InMETBins
JetMET/Jet/CleanedslimmedJetsAK8/Pt_profile
ParticleFlow/slimmedMETValidation/CompWithPFMET/{profileRMS_delta_set_VS_set_,profile_delta_set_VS_set_}

Also 20834.75, 20834.76, 20896.0, 20900.0, 21034.999, and 23234.0 show differences

(also https://github.com/cms-sw/cmssw/pull/41016#issuecomment-1462972599 can be related)

makortel commented 1 year ago

Here https://github.com/cms-sw/cmssw/pull/41328#issuecomment-1509411454 are also many differences in many JetMET folders in workflows 23234.0, 23634.0, 23634.911, 23696.0, 23700.0, 23834.999.

Curiously the baseline was run on Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (Broadwell) and the PR test on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (Cascade Lake).

makortel commented 1 year ago

assign upgrade

cmsbuild commented 1 year ago

New categories assigned: upgrade

@AdrianoDee,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

assign reconstruction, simulation

cmsbuild commented 1 year ago

New categories assigned: reconstruction,simulation

@mdhildreth,@mandrenguyen,@clacaputo,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

I looked a bit more details of the differences in https://github.com/cms-sw/cmssw/pull/42123#issuecomment-1611881110. I noticed in this case

Could some TensorFlow / ONNX ML model is somehow sensitive to the use of AVX-512 instructions? (we have seen similar behavior with some ML models before)

makortel commented 1 year ago

In https://github.com/cms-sw/cmssw/pull/42507#issuecomment-1670824559

missirol commented 1 year ago

https://github.com/cms-sw/cmssw/pull/42540#issuecomment-1674376843 and https://github.com/cms-sw/cmssw/pull/42534#issuecomment-1673646530 are probably examples of this issue (I do not know how to find the specs of the machines used for the tests).

mmusich commented 1 year ago

(I do not know how to find the specs of the machines used for the tests).

I would really be interested to know how to do that as well!

makortel commented 1 year ago

(I do not know how to find the specs of the machines used for the tests).

I would really be interested to know how to do that as well!

You can look at the end of the framework job report XML file (JobReport<N>.xml) of e.g. any step of any matrix workflow (as they are all run on the same machine, it doesn't matter which one). There is something along

<PerformanceSummary Metric="SystemCPU">
  <Metric Name="CPUModels" Value="Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz"/>

that tells the CPU model.

In https://github.com/cms-sw/cmssw/pull/42540#issuecomment-1674376843

In https://github.com/cms-sw/cmssw/pull/42534#issuecomment-1673646530

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42554#issuecomment-1675809497.

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42512#issuecomment-1678611490.

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42610#issuecomment-1685128180 :

missirol commented 1 year ago

Another example in https://github.com/cms-sw/cmssw/pull/42707#issuecomment-1703882846 :