Open makortel opened 3 years ago
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, xpog
New categories assigned: xpog,reconstruction
@slava77,@fgolf,@mariadalfonso,@perrotta,@jpata,@gouskos you have been requested to review this Pull request/Issue and eventually sign? Thanks
Noticed first in cms-sw/cms-bot#1456 (comment)
more recently in https://github.com/cms-sw/cmssw/pull/32622#issuecomment-757984277
@swozniewski @mbluj
Clicking through your messages, the difference looks always the same (unless the image was copied). Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference?
@kandrosov @lwezenbe fyi and in case you have any ideas spontaneously
Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference?
I believe we have so far saw the difference twice (cms-sw/cms-bot#1456 (comment) and https://github.com/cms-sw/cmssw/pull/32622#issuecomment-757984277), and indeed the difference looks the same in both (the images were not copied between the two PRs). It could be that there are only two possible outcomes, but with so few occurrences it is hard to say.
Looking at the logs for https://github.com/cms-sw/cmssw/pull/32622#issuecomment-757984277
tensorflow/core/platform/cpu_feature_guard.cc...
message, which IIRC corresponds to an info about not full utilization of the CPU capabilities, has a difference:
SSE4.1 SSE4.2 AVX AVX2 FMA
SSE4.1 SSE4.2 AVX
SSE4.1 SSE4.2 AVX AVX2 FMA
@smuzaffar @mrodozov do you know if there is a way to find out which nodes were used in to run the runTheMatrix jobs? (and what were their architectures).
@slava77 , both baseline and PR tests ran on cmsbuild machines
cmsbuild77
and its baseline CMSSW_11_3_X_2021-01-08-1100
was run on cmsbuild75
( https://cmssdt.cern.ch/jenkins/job/ib-run-baseline/20134/ ) cmsbuild03
( https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/12066/) and its baseline CMSSW_11_3_X_2021-01-10-2300
was run on cmsbuild01
( https://cmssdt.cern.ch/jenkins/job/ib-run-baseline/20153/)All these VMs are identical and support SSE4.1 SSE4.2 AVX AVX2 FMA
. If issue is with TF then it could be the VM where TF external was build. We do have an old physical machine vocms0315 without avx2 and that might have been used to build the TF.
Do we see the differences if we run multiple time using same IB?
All these VMs are identical and support
SSE4.1 SSE4.2 AVX AVX2 FMA
. If issue is with TF then it could be the VM where TF external was build. We do have an old physical machine vocms0315 without avx2 and that might have been used to build the TF.
It's odd, I thought that the warning message from TF was showing the flags present on the node where it's executed. The warning for 08-1100 set of tests showed that the PR test node did not have AVX2 FMA
. I'm not sure I understand how the TF build can affect anything, since it's supposedly the same in the PR and baseline cases with differences made using 08-1100.
Here is another example https://github.com/cms-sw/cmssw/pull/32782#issuecomment-771978588.
If still relevant, the TF warning line is
2021-02-02 19:41:40.193960: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
AddressSanitizer reports a stack-buffer-overflow
in DeepTauId::fillGrids()
(see #32837), could that be the cause for this non-reproducibility? (answer: "no")
Here is another example https://github.com/cms-sw/cmssw/pull/32947#issuecomment-781730246
@swozniewski @kandrosov @lwezenbe @mbluj is there a better understanding of this ?
I'm not aware of any news from TauPOG side about this. From this one and linked threads, it seemed to consolidate that the issue is related to dependencies between TF and hardware, so I didn't feel we can do much about it.
what are the different hardware leading to different outcome through TF? can this be reproduced "stand-alone" (i.e. without CMSSW) ?
see this https://github.com/cms-sw/cmssw/issues/33180 and this https://github.com/cms-sw/cmssw/issues/33442 if it helps although it's not strictly related to this workflow
there is one more case in #33706 https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-87155f/15037/summary.html
curiously, this time the change shows up also in particleNetMD, which is based on ONNX. Based on jenkins details the baseline here was running on cmsbuild73 (SSE4.1 SSE4.2 AVX AVX2 FMA), vs the PR test on cms-vocms0315 (SSE4.1 SSE4.2 AVX).
@hqucms is the ONNX in this case making a distinction between AVX and AVX2/FMA and running different methods?
is the ONNX in this case making a distinction between AVX and AVX2/FMA and running different methods?
@slava77 Yes, ONNXRuntime has different kernels for AVX and AVX2.
Showing up again in 1325.81, 136.731 in https://github.com/cms-sw/cmssw/pull/35216
all_OldVSNew_TTbar13nanoEDM106Xv1in2017wf1325p81
nanoaodFlatTable_fatJetTable__DQM_obj_floats__particleNetMD_QCD_100.png
nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu_442.png
all_mini_OldVSNew_RunSinglePh2016Bwf136p731
patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__73__second.png
patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__73__second285.png
patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__100__second312.png
patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__103__second315.png
Dumping the pairDiscri names:
73 pfParticleNetJetTags:probWcq
100 pfParticleNetDiscriminatorsJetTags:HccvsQCD
103 pfParticleNetDiscriminatorsJetTags:ZbbvsQCD
Showing up again in 1325.81, 136.731 in #35216
IIUC, this issue is in a state of a "known feature" now. The differences appear somewhat regularly, depending on the build machines using AVX or AVX2 for baseline or the reference.
I wanted to explicitly put the discriminator names and workflows here so they can be found with a github issue search. In #35216, it took me a bit to be sure all the differences are really from this ONNX feature.
related to #36552
Shows up in reco comparison of 1325.81 for
nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu
Noticed first in https://github.com/cms-sw/cms-bot/pull/1456#issuecomment-750900565