cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Non-reproducibility in DeepTau in 1325.81 #32628

Open makortel opened 3 years ago

makortel commented 3 years ago

Shows up in reco comparison of 1325.81 for nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu image

Noticed first in https://github.com/cms-sw/cms-bot/pull/1456#issuecomment-750900565

cmsbuild commented 3 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 years ago

assign reconstruction, xpog

cmsbuild commented 3 years ago

New categories assigned: xpog,reconstruction

@slava77,@fgolf,@mariadalfonso,@perrotta,@jpata,@gouskos you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 commented 3 years ago

Noticed first in cms-sw/cms-bot#1456 (comment)

more recently in https://github.com/cms-sw/cmssw/pull/32622#issuecomment-757984277

slava77 commented 3 years ago

@swozniewski @mbluj

swozniewski commented 3 years ago

Clicking through your messages, the difference looks always the same (unless the image was copied). Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference?

swozniewski commented 3 years ago

@kandrosov @lwezenbe fyi and in case you have any ideas spontaneously

makortel commented 3 years ago

Do I understand correctly and does it fit your observations so far that if there is an irreproducibility, it has only two outcomes, i.e. the observed diff and the reference?

I believe we have so far saw the difference twice (cms-sw/cms-bot#1456 (comment) and https://github.com/cms-sw/cmssw/pull/32622#issuecomment-757984277), and indeed the difference looks the same in both (the images were not copied between the two PRs). It could be that there are only two possible outcomes, but with so few occurrences it is hard to say.

slava77 commented 3 years ago

Looking at the logs for https://github.com/cms-sw/cmssw/pull/32622#issuecomment-757984277 tensorflow/core/platform/cpu_feature_guard.cc... message, which IIRC corresponds to an info about not full utilization of the CPU capabilities, has a difference:

@smuzaffar @mrodozov do you know if there is a way to find out which nodes were used in to run the runTheMatrix jobs? (and what were their architectures).

smuzaffar commented 3 years ago

@slava77 , both baseline and PR tests ran on cmsbuild machines

All these VMs are identical and support SSE4.1 SSE4.2 AVX AVX2 FMA. If issue is with TF then it could be the VM where TF external was build. We do have an old physical machine vocms0315 without avx2 and that might have been used to build the TF.

Do we see the differences if we run multiple time using same IB?

slava77 commented 3 years ago

All these VMs are identical and support SSE4.1 SSE4.2 AVX AVX2 FMA. If issue is with TF then it could be the VM where TF external was build. We do have an old physical machine vocms0315 without avx2 and that might have been used to build the TF.

It's odd, I thought that the warning message from TF was showing the flags present on the node where it's executed. The warning for 08-1100 set of tests showed that the PR test node did not have AVX2 FMA. I'm not sure I understand how the TF build can affect anything, since it's supposedly the same in the PR and baseline cases with differences made using 08-1100.

makortel commented 3 years ago

Here is another example https://github.com/cms-sw/cmssw/pull/32782#issuecomment-771978588. image

If still relevant, the TF warning line is

2021-02-02 19:41:40.193960: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
makortel commented 3 years ago

AddressSanitizer reports a stack-buffer-overflow in DeepTauId::fillGrids() (see #32837), could that be the cause for this non-reproducibility? (answer: "no")

makortel commented 3 years ago

Here is another example https://github.com/cms-sw/cmssw/pull/32947#issuecomment-781730246 image

mariadalfonso commented 3 years ago

@swozniewski @kandrosov @lwezenbe @mbluj is there a better understanding of this ?

swozniewski commented 3 years ago

I'm not aware of any news from TauPOG side about this. From this one and linked threads, it seemed to consolidate that the issue is related to dependencies between TF and hardware, so I didn't feel we can do much about it.

vlimant commented 3 years ago

what are the different hardware leading to different outcome through TF? can this be reproduced "stand-alone" (i.e. without CMSSW) ?

mrodozov commented 3 years ago

see this https://github.com/cms-sw/cmssw/issues/33180 and this https://github.com/cms-sw/cmssw/issues/33442 if it helps although it's not strictly related to this workflow

slava77 commented 3 years ago

there is one more case in #33706 https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-87155f/15037/summary.html

curiously, this time the change shows up also in particleNetMD, which is based on ONNX. Based on jenkins details the baseline here was running on cmsbuild73 (SSE4.1 SSE4.2 AVX AVX2 FMA), vs the PR test on cms-vocms0315 (SSE4.1 SSE4.2 AVX).

@hqucms is the ONNX in this case making a distinction between AVX and AVX2/FMA and running different methods?

hqucms commented 3 years ago

is the ONNX in this case making a distinction between AVX and AVX2/FMA and running different methods?

@slava77 Yes, ONNXRuntime has different kernels for AVX and AVX2.

jpata commented 3 years ago

Showing up again in 1325.81, 136.731 in https://github.com/cms-sw/cmssw/pull/35216

all_OldVSNew_TTbar13nanoEDM106Xv1in2017wf1325p81
  nanoaodFlatTable_fatJetTable__DQM_obj_floats__particleNetMD_QCD_100.png
  nanoaodFlatTable_tauTable__DQM_obj_floats__rawDeepTau2017v2p1VSmu_442.png
all_mini_OldVSNew_RunSinglePh2016Bwf136p731
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__73__second.png
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__73__second285.png
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__100__second312.png
  patJets_slimmedJetsAK8__reRECO_obj___pairDiscriVector__103__second315.png

Dumping the pairDiscri names:

73 pfParticleNetJetTags:probWcq
100 pfParticleNetDiscriminatorsJetTags:HccvsQCD
103 pfParticleNetDiscriminatorsJetTags:ZbbvsQCD
slava77 commented 3 years ago

Showing up again in 1325.81, 136.731 in #35216

IIUC, this issue is in a state of a "known feature" now. The differences appear somewhat regularly, depending on the build machines using AVX or AVX2 for baseline or the reference.

jpata commented 3 years ago

I wanted to explicitly put the discriminator names and workflows here so they can be found with a github issue search. In #35216, it took me a bit to be sure all the differences are really from this ONNX feature.

vlimant commented 1 year ago

related to #36552