Open VinInn opened 1 year ago
A new Issue was created by @VinInn Vincenzo Innocente.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction
New categories assigned: reconstruction
@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
Duplicate of #40437 and #28358
I leave it to @cms-sw/reconstruction-l2 to decide if continue the discussion in #40437 or here
@mbluj @azotz Can you please have a look?
type tau
Value of pred
variable should lie in [0,1] range by its construction. A value outside of this range is a sign of some processing issue, highly probably some memory corruption as repored in https://github.com/cms-sw/cmssw/issues/40437 and https://github.com/cms-sw/cmssw/issues/28358. Unfortunately the issue is non-reproducible.
Anyway, a simple replacement of an exception by a LogError seems not be the best solution as it can cause that a nonsense value (possibly NaN) is passed to further processing. If it is mandatory to issue a LogError one should also set pred
to some safe default (-1, 0?) which will effectively mean that a tau candidate with this value of pred
is rejected at analysis level. @kandrosov what do you think?
Yes, I have commented in #28358, if the condition do not satisfied it means that there has been some memory corruption, so any computations after that point shouldn't be trusted. Therefore, simply reporting in LogError is not sufficient. It was confirmed in some private productions that after first event with NaN
, if exception is not thrown, all consecutive events will have NaN
s in deepTau scores.
If you consider aborting the whole CMSSW for that is an overkill, possible solution is to disable modules that can be potentially affected by this memory corruption. E.g. after the condition above is occurred all Tau-related modules are disabled for all consecutive events and instead of tau collection some flag that indicates that tau sequences are disabled is stored in the output root file. However, I don't know if such mechanisms is implemented within CMSSW.
it would be useful to have full details of the Hardware were the error occurred. If there is a real hardware issue and nothing specifically related to the code used by DeepTau it could affect many other parts of any kind of job running there.
We may have to develop and deploy some sort of ad-hoc test or just ask site to regularly run memory-checks.
My understanding is that TensorFlow is used in DeepTauId. TnsorFlow is a fat library: uses different code on different hardware: I wonder if there has been any test to compare its result on a sse only machine, avx, avx2 and axv512. If yes, which SSE-only machine has been used?
@VinInn hm.. I just go one of these errors in some of my private nano production jobs. It seems that NaN conditions is now occurring more frequently wrt to the past. Here is an example log file. Do you think we can extract something useful from it about the hardware?
One interesting observation: It was a crab job with a custom executable that consecutively runs CMSSW instances for each input file. This particular job had two input files and for both cmssw crashed on the first event due to the NaN problem. So the behaviour seems to be persistent on that worker node.
HERE WE GO.
== CMSSW: 2023-02-15 13:57:17.162144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2
I bet we never tested on sse-only machines.... My suspicion is wrong unaligned load...
The log happens to contain also the exact processor model
Current processor: model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
in case that would be useful.
I failed to find a sse-only machine at cern. If anybody can get her hands on one of those would be nice to run valgrind and/or other "memory" tools
I found one node with Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. It should also be SSE-only. I run few times nano sequence on the same input file as in the log above, but it always runs successfully without any sign of NaNs... Here is a valgrind output.
@kandrosov : would it be easy for you to check that the results of deepTauId is identical on this machine and on a AVX2/AVX512 one?
it's plenty of
<what>Conditional jump or move depends on uninitialised value(s)</what>
<stack>
<frame>
<ip>0x5174599</ip>
<obj>/cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-bf41b0420bc269850b74e23486e2953a/lib/libHist.so</obj>
<fn>TStorage::UpdateIsOnHeap(unsigned int const volatile&, unsigned int volatile&)</fn>
is it a false positive?
the rest is sometimes scary but clearly in python or cling...
It’s a false positive. In the ROOT release area they have a filter file one can use with valgrind to ignore those messages.
I'm running
valgrind --tool=memcheck --leak-check=full --suppressions=${ROOTSYS}/etc/valgrind-root.supp
and btw I also see many
==78542== Conditional jump or move depends on uninitialised value(s)
==78542== at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
==78542== Use of uninitialised value of size 8
==78542== at 0x6DB543F: sinf (in /usr/lib64/libm-2.17.so)
==78542== by 0x52ED2ECA: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
==78542== Conditional jump or move depends on uninitialised value(s)
==78542== at 0x64F7A91: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/xz/5.2.5-d6fed2038c4e8d6e04531d1adba59f37/lib/liblzma.so.5.2.5)
some improper delete
==78542== Invalid free() / delete / delete[] / realloc()
==78542== at 0x4030F5C: free (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==78542== by 0x52EC1311: mkfit::ExecutionContext::~ExecutionContext() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
and a large number of memory leaks in tensorflow....
and btw I also see many
==78542== Conditional jump or move depends on uninitialised value(s) ==78542== at 0x52ED3C34: mkfit::helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 4> const&, Matriplex::Matriplex<int, 1, 1, 4> const&, Matriplex::Matriplex<float, 1, 1, 4> const&, Matriplex::Matriplex<float, 6, 1, 4>&, Matriplex::Matriplex<float, 6, 6, 4>&, Matriplex::Matriplex<int, 1, 1, 4>&, int, mkfit::PropagationFlags) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_0/lib/slc7_amd64_gcc10/libRecoTrackerMkFitCore.so)
IIUC, this was fixed in #40105 in 12_6_X (merged in pre5, apparently) At least the fixed cases were harmless and were happening in parts of the fixed size loop that were discarded downstream.
Thanks @slava77 , I was suspecting something like that (and sorry if I'm still runnning and old benchmark (hepix benchmark))
the nan in deepTau could be blocking the rereco of 2022 https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022B-DoubleMuonLowMass-10Dec2022-00001
here https://cmssdt.cern.ch/dxr/CMSSW/source/RecoTauTag/RecoTau/plugins/DeepTauId.cc#1296 a LogError should be issued not what is an effective job abort