cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

Exception instead of LogError in DeepTauId #40733

Open VinInn opened 1 year ago

VinInn commented 1 year ago

here https://cmssdt.cern.ch/dxr/CMSSW/source/RecoTauTag/RecoTau/plugins/DeepTauId.cc#1296 a LogError should be issued not what is an effective job abort

cmsbuild commented 1 year ago

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign reconstruction

cmsbuild commented 1 year ago

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

Duplicate of #40437 and #28358

I leave it to @cms-sw/reconstruction-l2 to decide if continue the discussion in #40437 or here

mandrenguyen commented 1 year ago

@mbluj @azotz Can you please have a look?

mandrenguyen commented 1 year ago

type tau

mbluj commented 1 year ago

Value of pred variable should lie in [0,1] range by its construction. A value outside of this range is a sign of some processing issue, highly probably some memory corruption as repored in https://github.com/cms-sw/cmssw/issues/40437 and https://github.com/cms-sw/cmssw/issues/28358. Unfortunately the issue is non-reproducible. Anyway, a simple replacement of an exception by a LogError seems not be the best solution as it can cause that a nonsense value (possibly NaN) is passed to further processing. If it is mandatory to issue a LogError one should also set pred to some safe default (-1, 0?) which will effectively mean that a tau candidate with this value of pred is rejected at analysis level. @kandrosov what do you think?

kandrosov commented 1 year ago

Yes, I have commented in #28358, if the condition do not satisfied it means that there has been some memory corruption, so any computations after that point shouldn't be trusted. Therefore, simply reporting in LogError is not sufficient. It was confirmed in some private productions that after first event with NaN, if exception is not thrown, all consecutive events will have NaNs in deepTau scores.

If you consider aborting the whole CMSSW for that is an overkill, possible solution is to disable modules that can be potentially affected by this memory corruption. E.g. after the condition above is occurred all Tau-related modules are disabled for all consecutive events and instead of tau collection some flag that indicates that tau sequences are disabled is stored in the output root file. However, I don't know if such mechanisms is implemented within CMSSW.

VinInn commented 1 year ago

it would be useful to have full details of the Hardware were the error occurred. If there is a real hardware issue and nothing specifically related to the code used by DeepTau it could affect many other parts of any kind of job running there.

We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks.

VinInn commented 1 year ago

My understanding is that TensorFlow is used in DeepTauId. TnsorFlow is a fat library: uses different code on different hardware: I wonder if there has been any test to compare its result on a sse only machine, avx, avx2 and axv512. If yes, which SSE-only machine has been used?

kandrosov commented 1 year ago

@VinInn hm.. I just go one of these errors in some of my private nano production jobs. It seems that NaN conditions is now occurring more frequently wrt to the past. Here is an example log file. Do you think we can extract something useful from it about the hardware?

One interesting observation: It was a crab job with a custom executable that consecutively runs CMSSW instances for each input file. This particular job had two input files and for both cmssw crashed on the first event due to the NaN problem. So the behaviour seems to be persistent on that worker node.

VinInn commented 1 year ago

HERE WE GO.

== CMSSW: 2023-02-15 13:57:17.162144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2

I bet we never tested on sse-only machines.... My suspicion is wrong unaligned load...

makortel commented 1 year ago

The log happens to contain also the exact processor model

Current processor: model name   : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

in case that would be useful.

VinInn commented 1 year ago

https://ark.intel.com/content/www/fr/fr/ark/products/48768/intel-xeon-processor-e5645-12m-cache-2-40-ghz-5-86-gts-intel-qpi.html

VinInn commented 1 year ago

I failed to find a sse-only machine at cern. If anybody can get her hands on one of those would be nice to run valgrind and/or other "memory" tools

kandrosov commented 1 year ago

I found one node with Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. It should also be SSE-only. I run few times nano sequence on the same input file as in the log above, but it always runs successfully without any sign of NaNs... Here is a valgrind output.

VinInn commented 1 year ago

@kandrosov : would it be easy for you to check that the results of deepTauId is identical on this machine and on a AVX2/AVX512 one?

VinInn commented 1 year ago

it's plenty of

  <what>Conditional jump or move depends on uninitialised value(s)</what>
  <stack>
    <frame>
      <ip>0x5174599</ip>
      <obj>/cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-bf41b0420bc269850b74e23486e2953a/lib/libHist.so</obj>
      <fn>TStorage::UpdateIsOnHeap(unsigned int const volatile&amp;, unsigned int volatile&amp;)</fn>

is it a false positive?

the rest is sometimes scary but clearly in python or cling...

Dr15Jones commented 1 year ago

It’s a false positive. In the ROOT release area they have a filter file one can use with valgrind to ignore those messages.

VinInn commented 1 year ago

I'm running

valgrind --tool=memcheck --leak-check=full --suppressions=${ROOTSYS}/etc/valgrind-root.supp

and btw I also see many

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
==78542== Use of uninitialised value of size 8
==78542==    at 0x6DB543F: sinf (in /usr/lib64/libm-2.17.so)
==78542==    by 0x52ED2ECA: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x64F7A91: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/xz/5.2.5-d6fed2038c4e8d6e04531d1adba59f37/lib/liblzma.so.5.2.5)

some improper delete

==78542== Invalid free() / delete / delete[] / realloc()
==78542==    at 0x4030F5C: free (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==78542==    by 0x52EC1311: mkfit::ExecutionContext::~ExecutionContext() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

and a large number of memory leaks in tensorflow....

slava77 commented 1 year ago

and btw I also see many

==78542== Conditional jump or move depends on uninitialised value(s)
==78542==    at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)

IIUC, this was fixed in #40105 in 12_6_X (merged in pre5, apparently) At least the fixed cases were harmless and were happening in parts of the fixed size loop that were discarded downstream.

VinInn commented 1 year ago

Thanks @slava77 , I was suspecting something like that (and sorry if I'm still runnning and old benchmark (hepix benchmark))

slava77 commented 1 year ago

the nan in deepTau could be blocking the rereco of 2022 https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022B-DoubleMuonLowMass-10Dec2022-00001