Open makortel opened 1 year ago
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign dqm
FYI @cms-sw/jetmet-pog-l2
New categories assigned: dqm
@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
These were first seen in https://github.com/cms-sw/cmssw/pull/39699 (see logs] , so could #39699 change is responsible for this?
An earlier test in https://github.com/cms-sw/cmssw/pull/39699#issuecomment-1278855112 reports only 6 DQM histograms with comparison differences, which would suggest that #39699 would not be responsible for the differences (or it at least the answer is less clear).
On the other hand, the occurrence of these differences seem to be random and not very frequent, so it could be that the PR responsible for this has clean comparisons in its tests.
This got somehow fixed, since the same histos now reproduce nicely. Can this get closed?
Sure
Seems that we are again seeing these
Documenting here https://github.com/cms-sw/cmssw/pull/41019#issuecomment-1463003532 workflow 20834.0 shows differences in
JetMET/METValidation/slimmedMETsPuppi/{METResolution_GenMETTrue_InMETBins, METUnc_ElectronEnDown, METUnc_ElectronEnUp}
JetMET/METValidation/PfMetT0pcT1/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/PfMetT1/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/pfMet/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/pfMetT0pc/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/slimmedMETs/METResolution_GenMETTrue_InMETBins
JetMET/Jet/CleanedslimmedJetsAK8/Pt_profile
ParticleFlow/slimmedMETValidation/CompWithPFMET/{profileRMS_delta_set_VS_set_,profile_delta_set_VS_set_}
Also 20834.75, 20834.76, 20896.0, 20900.0, 21034.999, and 23234.0 show differences
(also https://github.com/cms-sw/cmssw/pull/41016#issuecomment-1462972599 can be related)
Here https://github.com/cms-sw/cmssw/pull/41328#issuecomment-1509411454 are also many differences in many JetMET folders in workflows 23234.0, 23634.0, 23634.911, 23696.0, 23700.0, 23834.999.
Curiously the baseline was run on Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
(Broadwell) and the PR test on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
(Cascade Lake).
assign upgrade
New categories assigned: upgrade
@AdrianoDee,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks
assign reconstruction, simulation
New categories assigned: reconstruction,simulation
@mdhildreth,@mandrenguyen,@clacaputo,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks
I looked a bit more details of the differences in https://github.com/cms-sw/cmssw/pull/42123#issuecomment-1611881110. I noticed in this case
Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)Intel(R) Xeon(R) Gold 5218
(Cascade Lake)Could some TensorFlow / ONNX ML model is somehow sensitive to the use of AVX-512 instructions? (we have seen similar behavior with some ML models before)
In https://github.com/cms-sw/cmssw/pull/42507#issuecomment-1670824559
Intel(R) Xeon(R) Gold 5218
(Cascade Lake)Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)https://github.com/cms-sw/cmssw/pull/42540#issuecomment-1674376843 and https://github.com/cms-sw/cmssw/pull/42534#issuecomment-1673646530 are probably examples of this issue (I do not know how to find the specs of the machines used for the tests).
(I do not know how to find the specs of the machines used for the tests).
I would really be interested to know how to do that as well!
(I do not know how to find the specs of the machines used for the tests).
I would really be interested to know how to do that as well!
You can look at the end of the framework job report XML file (JobReport<N>.xml
) of e.g. any step of any matrix workflow (as they are all run on the same machine, it doesn't matter which one). There is something along
<PerformanceSummary Metric="SystemCPU">
<Metric Name="CPUModels" Value="Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz"/>
that tells the CPU model.
In https://github.com/cms-sw/cmssw/pull/42540#issuecomment-1674376843
Intel(R) Xeon(R) Gold 5218 CPU
(Cascade Lake)Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)In https://github.com/cms-sw/cmssw/pull/42534#issuecomment-1673646530
Intel(R) Xeon(R) Silver 4216 CPU
(Cascade Lake)Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)Another example in https://github.com/cms-sw/cmssw/pull/42554#issuecomment-1675809497.
Intel(R) Xeon(R) Gold 5218 CPU
(Cascade Lake).Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell).Another example in https://github.com/cms-sw/cmssw/pull/42512#issuecomment-1678611490.
Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell).Intel(R) Xeon(R) Silver 4216 CPU
(Cascade Lake).Another example in https://github.com/cms-sw/cmssw/pull/42610#issuecomment-1685128180 :
Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)Intel(R) Xeon(R) Silver 4216 CPU
(Cascade lake)Another example in https://github.com/cms-sw/cmssw/pull/42707#issuecomment-1703882846 :
Intel(R) Xeon(R) Silver 4216 CPU
(Cascade lake)Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)
It seems that we have non-reproducibility in some
JetMET/{Jet,MET}Validation
histograms that are visible in PR tests. So far seen (at least) in