`DeepTauId` failures in RelVals (`Incompatible shapes`)

AdrianoDee commented 6 months ago

Running RelVals we are observing some failures due to a tensorflow exception coming from DeepTauId module. Some examples listed here.

1) 2023 Data reHLT + reRECO

In HLTDR3_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in 14_0_0_pre3 RelVals

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 11 event: 22076365 stream: 0
[1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

with the config here, that is what we get from wf 141.035 running L1REPACK:Full,HLT:@relval2024 (HLT pointing at GRun here). The error here. The wf on Stats2.

Also in the same step in 13_3_0_pre5 RunDisplacedJet2023C in a different path (HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6) run in HLT:@relval2023. The error here. The wf on Stats2.

2) 2022 Data reHLT + reRECO

Much rarer in AODNANORUN3_reHLT_2022 step in deepTau2017v2p1ForMini in RunJetMET2022D with 14_0_0 The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 357735 lumi: 20 event: 32782226 stream: 0
[1] Running path 'NANOEDMAODoutput_step'
[2] Prefetching for module PoolOutputModule/'NANOEDMAODoutput'
[3] Prefetching for module SimpleCandidateFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[13] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

3) MC 2023

In DigiPU_2023PU step in hltHpsPFTauDeepTauProducer in RelValTenTau_15_500 with 13_3_0_pre1 (at the moment the first occurrence I found). The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 18 event: 1707 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

CPU

At the moment it appears that in all cases the jobs were running on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (or on a Gold one), Cascade Lake (see https://github.com/cms-sw/cmssw/issues/44333#issuecomment-1983672263).

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @AdrianoDee.

@Dr15Jones, @antoniovilela, @smuzaffar, @makortel, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

AdrianoDee commented 6 months ago

assign hlt

AdrianoDee commented 6 months ago

assign pdmv

cmsbuild commented 6 months ago

New categories assigned: hlt,pdmv

@Martin-Grunewald,@mmusich,@AdrianoDee,@sunilUIET,@miquork you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 6 months ago

@cms-sw/tau-pog-l2 FYI

mmusich commented 6 months ago

type tau

mmusich commented 6 months ago

just as an observation this path is not new (first included in the GRun menu in 2022, https://its.cern.ch/jira/browse/CMSHLT-2289)

EDIT but was touched recently in https://its.cern.ch/jira/browse/CMSHLT-3052

mmusich commented 6 months ago

@cms-sw/pdmv-l2

In data reHLT+reRECO RelVals we are observing some failures at HLTDR_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7

Please help filling in some information:

In which release is this happening?
Is it reproducibile?
Does it affect all jobs of the relvals?
Is there a pattern w.r.t. the CPU microarchitecture of the node on which the job lands?

Martin-Grunewald commented 6 months ago

I can't find it in the Dashboard. Since it is labelled HLTDR_2023, and the path in question is not in the Fake* menus, it must be in some 13_X release running the actual 2023 HLT with the 2023 version of that path.

AdrianoDee commented 6 months ago

Quick answers:

this happened both in 14_0_0_pre3 and 14_0_0 but I'm tracking it back to older releases (coming back as soon as I find the first occurrence);
it just happens on a fraction of the jobs and the fraction itself is quite random (fluctuates in the order of few percentages of the events failing).

For the reproducibility and the CPU pattern I'll need a moment to check those.

Martin-Grunewald commented 6 months ago

Hmm well, in 14_X, HLTDR_2023 should (now) run the Fake* menus, while the real HLT menus should be within HLTDR_2024.

mmusich commented 6 months ago

in 14_X, HLTDR_2023 should (now) run the Fake* menus, while the real HLT menus should be within HLTDR_2024

Indeed the configuration linked above has L1REPACK:Full,HLT:@relval2024, but in absence of real 2024 data we're running the 2024 menu on 2023 data.

AdrianoDee commented 6 months ago

I see the same (similar) error

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 122 event: 206577729 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

in 13_3_0_pre5 RunDisplacedJet2023C running L1REPACK:Full,HLT:@relval2023.

mmusich commented 6 months ago

HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6

This is a different path, so it points to a general problem with DeepTauId (path-aspecific)

Dr15Jones commented 6 months ago

For context, it appears the exception comes from here:

https://github.com/cms-sw/cmssw/blob/ff5142841273d612645b215bc338f655fd73ed3d/PhysicsTools/TensorFlow/src/TensorFlow.cc#L272-L275

makortel commented 6 months ago

assign ml

makortel commented 6 months ago

assign reconstruction

cmsbuild commented 6 months ago

New categories assigned: ml,reconstruction

@jfernan2,@mandrenguyen,@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 6 months ago

There is also an earlier episode https://github.com/cms-sw/cmssw/issues/42862

mmusich commented 6 months ago

There is also an earlier episode https://github.com/cms-sw/cmssw/issues/42862

That was affecting only phase2 workflows and got fixed by https://github.com/cms-sw/cmssw/pull/43855

makortel commented 6 months ago

error

Following the links there pointed to https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_RVCMSSW_13_3_0_pre5RunDisplacedJet2023C__Data_2023_RelVal_2023C_231107_154737_5273/8001/HLTDR3_2023/04130c52-d023-4ad4-8e5d-5dbecdb27cab-106-0-logArchive/ according to which the job was ran on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz that is Cascade Lake. It has the same AVX512F AVX512_VNNI features that in https://github.com/cms-sw/cmssw/issues/42862 somehow seemed play a role.

AdrianoDee commented 6 months ago

Few additional things I found out after investigating:

it appears to happen also on MC but is visible only in PU samples so it's not that easy to reproduce for MCs. I'm trying to reproduce it for Data.
~from what I see the first occurrence was in 13_3_0_pre5~
MC wf failing in 13_3_0_pre5 pdmvserv_RVCMSSW_13_3_0_pre5TenTau_15_500_231127_105150_2624 with

An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 26 event: 2570 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v4'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

in this case (and also in all the others I found) the job was ran on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (or on a Gold one), still Cascade Lake.
~the possible culprit may be https://github.com/cms-sw/cmssw/pull/43100 included in 13_3_0_pre5 but I haven't done much investigative effort there. I don't see any updates to the TF backend there that would justify this.~
Update : this was there already there in 13_3_0_pre1 (the unified logs).

kandrosov commented 6 months ago

Just to add another piece of information, I see many similar errors in my private HLT rerun with CMSSW_14_0_0. However, the error occurs in hltL2TauTagNNProducer, which runs another TF-based tau tagger whose code has not changed for one year (even more, if we ignore minor commits that do not affect functionality). For example:

== CMSSW: 2024-03-09 09:00:02.226048: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: scale must have the same number of elements as the channels of x, got 80 and 31
== CMSSW:    [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/batch_normalization_CNN1x1_0/FusedBatchNormV3}}]]
== CMSSW: ----- Begin Fatal Exception 09-Mar-2024 09:00:04 CET-----------------------
== CMSSW: An exception of category 'InvalidRun' occurred while
== CMSSW:    [0] Processing  Event run: 369870 lumi: 219 event: 67715906 stream: 0
== CMSSW:    [1] Running path 'nanoAOD_step'
== CMSSW:    [2] Calling method for module L2TauNNProducer/'hltL2TauTagNNProducer'
== CMSSW: Exception Message:
== CMSSW: error while running session: INVALID_ARGUMENT: scale must have the same number of elements as the channels of x, got 80 and 31
== CMSSW:    [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/batch_normalization_CNN1x1_0/FusedBatchNormV3}}]]

I checked the CPU architectures for a few crashed jobs: Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz and Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz. It looks like a more general TF-related issue that affects not only DeepTau.

makortel commented 6 months ago

Makes me wonder if the root cause could be the same as in https://github.com/cms-sw/cmssw/issues/42444 ...

Would anyone have a simple recipe to reproduce any of these?

AdrianoDee commented 6 months ago

Makes me wonder if the root cause could be the same as in #42444 ...

Would anyone have a simple recipe to reproduce any of these?

@makortel something like this should reproduce the error in principle

cmsDriver.py step2 --conditions auto:run3_hlt_relval --data --datatier FEVTDEBUGHLT --era Run3_2023 --eventcontent FEVTDEBUGHLT --filein /store/data/Run2023C/DisplacedJet/RAW/v1/000/367/131/00000/9f3f571f-6dc9-4bda-a68b-5d1b9a5fc3ac.root --fileout file:step2.root --nStreams 4 --nThreads 8 --number 10 --process reHLT --python_filename step_2_cfg.py --step L1REPACK:Full,HLT:@relval2024 --customise_commands "process.source.skipEvents = cms.untracked.uint32(1800)"

since it would end up running the same reHLT process on top of the same Event (195390586) of the same Run (367131) for which the failure appears here. But I'm not being able to reproduce it actually.

mmusich commented 6 months ago

since it would end up running the same reHLT process on top of the same Event (195390586) of the same Run (367131) for which the failure appears

since the process is run multi-threaded are you sure that the last event that leaves a message logger record is also the one crashing the process?

mmusich commented 6 months ago

@makortel @AdrianoDee I can reproduce in the following way:

1) go on lxplus901 (in order to have a machine with Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz) 2) prepare the input file via:

edmCopyPickMerge outputFile=pickevents.root eventsToProcess=367131:196942831 inputFiles=/store/data/Run2023C/DisplacedJet/RAW/v1/000/367/131/00000/9f3f571f-6dc9-4bda-a68b-5d1b9a5fc3ac.root

3) run:

cmsDriver.py step2 --conditions auto:run3_hlt_relval --data --datatier FEVTDEBUGHLT --era Run3_2023 --eventcontent FEVTDEBUGHLT --filein file:pickevents.root --fileout file:step2.root --nStreams 4 --nThreads 8 --number -1 --process reHLT --python_filename step_2_cfg.py --step L1REPACK:Full,HLT:@relval2024 --accelerators cpu

In CMSSW_14_0_0 it crashes with [1] Notice that in a recent IB (CMSSW_14_1_X_2024-03-10-2300) the issue seems to have disappeared.

[1]

L1REPACK:Full,HLT:@relval2024,ENDJOB
entry file:pickevents.root
Step: L1REPACK Spec: ['Full']
# L1T INFO:  L1REPACK:Full will unpack all L1T inputs, re-emulated (Stage-2), and pack uGT, uGMT, and Calo Stage-2 output.
Step: HLT Spec: ['@relval2024']
Step: ENDJOB Spec: 
Starting  cmsRun  step_2_cfg.py
# L1T INFO:  L1REPACK:Full will unpack all L1T inputs, re-emulated (Stage-2), and pack uGT, uGMT, and Calo Stage-2 output.
%MSG-i ThreadStreamSetup:  (NoModuleName) 12-Mar-2024 00:23:22 CET pre-events
setting # threads 8
setting # streams 4
%MSG
%MSG-i AlpakaService:  (NoModuleName) 12-Mar-2024 00:23:23 CET pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
%MSG
...
Begin processing the 1st record. Run 367131, Event 196942831, LumiSection 117 on stream 3 at 12-Mar-2024 00:24:02.428 CET
#--------------------------------------------------------------------------
#                         FastJet release 3.4.1
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                         
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
2024-03-12 00:24:06.899321: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: Incompatible shapes: [0,1,1,86] vs. [207]
     [[{{node inner_egamma_norm_1/FusedBatchNorm_1/Mul}}]]
----- Begin Fatal Exception 12-Mar-2024 00:24:06 CET-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 367131 lumi: 117 event: 196942831 stream: 3
   [1] Running path 'HLT_DoublePFJets40_Mass500_MediumDeepTauPFTauHPS45_L2NN_MediumDeepTauPFTauHPS20_eta2p1_v6'
   [2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,86] vs. [207]
     [[{{node inner_egamma_norm_1/FusedBatchNorm_1/Mul}}]]
----- End Fatal Exception -------------------------------------------------

AdrianoDee commented 6 months ago

since the process is run multi-threaded are you sure that the last event that leaves a message logger record is also the one crashing the process?

Thanks Marco, indeed I was forgetting this.

AdrianoDee commented 6 months ago

For the records (and my mental health) I wasn't anyway able to reproduce it on an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz with my original setup (that was anyway by chance hitting the event 196942831) in 14_0_0

%MSG-i ThreadStreamSetup:  (NoModuleName) 11-Mar-2024 23:49:31 CET pre-events
setting # threads 8
setting # streams 4
%MSG
11-Mar-2024 23:50:13 CET  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/data/Run2023C/DisplacedJet/RAW/v1/000/367/131/00000/9f3f571f-6dc9-4bda-a68b-5d1b9a5fc3ac.root
11-Mar-2024 23:50:17 CET  Successfully opened file root://eoscms.cern.ch//eos/cms/store/data/Run2023C/DisplacedJet/RAW/v1/000/367/131/00000/9f3f571f-6dc9-4bda-a68b-5d1b9a5fc3ac.root
%MSG-w NonConsumedConditionalModules:  AfterModConstruction  11-Mar-2024 23:50:37 CET pre-events
The following modules were part of some ConditionalTask, but were not
consumed by any other module in any of the Paths to which the ConditionalTask
was associated. Perhaps they should be either removed from the
job, or moved to a Task to make it explicit they are unscheduled.

 hltPixelTracksTrackingRegions
 hltSiPixelClustersCache
 hltSiPixelClustersCacheCPUOnly
 hltSiPixelClustersFromSoA
 hltSiPixelDigisSoA
 hltSiPixelRecHitsFromGPU
 hltSiPixelRecHitsSoA
 statusOnGPU@cuda
%MSG

[...]

Begin processing the 1st record. Run 367131, Event 195019958, LumiSection 117 on stream 3 at 11-Mar-2024 23:50:48.813 CET
Begin processing the 2nd record. Run 367131, Event 196362425, LumiSection 117 on stream 0 at 11-Mar-2024 23:50:48.814 CET
Begin processing the 3rd record. Run 367131, Event 196360607, LumiSection 117 on stream 2 at 11-Mar-2024 23:50:48.816 CET
Begin processing the 4th record. Run 367131, Event 196460914, LumiSection 117 on stream 1 at 11-Mar-2024 23:50:49.206 CET
#--------------------------------------------------------------------------
#                         FastJet release 3.4.1
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                         
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
Begin processing the 5th record. Run 367131, Event 194945538, LumiSection 117 on stream 3 at 11-Mar-2024 23:50:51.795 CET
Begin processing the 6th record. Run 367131, Event 194945544, LumiSection 117 on stream 2 at 11-Mar-2024 23:50:52.004 CET
Begin processing the 7th record. Run 367131, Event 195266551, LumiSection 117 on stream 1 at 11-Mar-2024 23:50:52.028 CET
Begin processing the 8th record. Run 367131, Event 196331770, LumiSection 117 on stream 0 at 11-Mar-2024 23:50:52.104 CET
Begin processing the 9th record. Run 367131, Event 196942831, LumiSection 117 on stream 2 at 11-Mar-2024 23:50:52.774 CET
Begin processing the 10th record. Run 367131, Event 196939181, LumiSection 117 on stream 3 at 11-Mar-2024 23:50:52.866 CET
11-Mar-2024 23:50:53 CET  Closed file root://eoscms.cern.ch//eos/cms/store/data/Run2023C/DisplacedJet/RAW/v1/000/367/131/00000/9f3f571f-6dc9-4bda-a68b-5d1b9a5fc3ac.root

makortel commented 6 months ago

I can reproduce in the following way: ... In CMSSW_14_0_0 it crashes with [1]

Thanks, I was able to reproduce.

Notice that in a recent IB (CMSSW_14_1_X_2024-03-10-2300) the issue seems to have disappeared.

The reproduced succeeds also in 14_1_0_pre1.

makortel commented 6 months ago

In 14_0_0, the exception is thrown via

(gdb) where
#0  0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7fff0b218c00, tinfo=0x7ffff79a3668 <typeinfo for cms::Exception>, dest=0x7ffff796ce20 <cms::Exception::~Exception()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffbd01b989 in tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolOptions const&) [clone .cold] ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libPhysicsToolsTensorFlow.so
#2  0x00007fffbd020589 in tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolInterface*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libPhysicsToolsTensorFlow.so
#3  0x00007fff710a4924 in DeepTauId::getPartialPredictions(bool) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so
#4  0x00007fff710b0b68 in void DeepTauId::createConvFeatures<reco::PFCandidate, reco::PFTau>(reco::PFTau const&, unsigned long, edm::RefToBase<reco::BaseTau>, reco::Vertex const&, double, std::vector<pat::Electron, std::allocator<pat::Electron> > const*, std::vector<pat::Muon, std::allocator<pat::Muon> > const*, edm::View<reco::Candidate> const&, (anonymous namespace)::CellGrid const&, (anonymous namespace)::TauFunc, bool) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so
#5  0x00007fff710b3643 in void DeepTauId::getPredictionsV2<reco::PFCandidate, reco::PFTau>(reco::BaseTau const&, unsigned long, edm::RefToBase<reco::BaseTau>, std::vector<pat::Electron, std::allocator<pat::Electron> > const*, std::vector<pat::Muon, std::allocator<pat::Muon> > const*, edm::View<reco::Candidate> const&, reco::Vertex const&, double, unsigned long long const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >&, (anonymous namespace)::TauFunc) [clone .lto_priv.0] ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so
#6  0x00007fff710aa903 in DeepTauId::produce(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so
#7  0x00007ffff7e483c1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#8  0x00007ffff7e2c04e in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so

mmusich commented 6 months ago

The reproduced succeeds also in 14_1_0_pre1.

I think this happens simply because that particular trigger path (HLT_DoublePFJets40_Mass500_MediumDeepTauPFTauHPS45_L2NN_MediumDeepTauPFTauHPS20_eta2p1_v) got removed in the meanwhile in https://github.com/cms-sw/cmssw/pull/44073 (14_1_X) and https://github.com/cms-sw/cmssw/pull/44074 (14_0_X). I think the reproducer would succeed in CMSSW_14_0_1 as well (but I didn't test it).

makortel commented 6 months ago

In 14_0_0

the exception occurs consistently with cmsRun, cmsRunTC, and cmsRunGlibC
running valgrind with cmsRun does not result in an exception (or any related warnings)

fwyzard commented 6 months ago

I haven't seen any comments from @cms-sw/ml-l2 , are they aware of the issue ?

valsdav commented 6 months ago

Hi all! I investigated the reproducer and I think I found the issue.

The number of valid_grid_cells here is 0 for this event and this is creating a TF::Tensor with shape [0, 1, 1, N].

In TensorFlow this is a valid tensor which has a specific shape but it is empty.

>>> import tensorflow as tf
>>> tensor = tf.zeros([0, 1, 1, 86])
>>> tensor
<tf.Tensor: shape=(0, 1, 1, 86), dtype=float32, numpy=array([], shape=(0, 1, 1, 86), dtype=float32)>
>>> tf.print(tensor)
[]

Apparently, when this input is passed to a TF model executed on a CPU without AVX512F AVX512_VNNI, the model is executed and returns an empty output without complaining. When AVX512F AVX512_VNNI instructions are present, the jitting is different and the TF executor complains. Now, I'm not saying that it is understood why this happens, but this is the reason of the crash.

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

makortel commented 6 months ago

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5 .

mmusich commented 6 months ago

urgent

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

@valsdav, we have established that this issue can affect Prompt Reconstruction and (potentially, when the new nodes for the HLT farm arrive) also online trigger operations. Please prepare PRs with guards to avoid the execution of the model with empty inputs. Thank you.

Marco (as ORM)

mmusich commented 6 months ago

for record, the proposed fixes are:

https://github.com/cms-sw/cmssw/pull/44455 (master)
https://github.com/cms-sw/cmssw/pull/44456 (14.0.X)

jfernan2 commented 6 months ago

+1 solved by https://github.com/cms-sw/cmssw/pull/44455

valsdav commented 6 months ago

+ml

Basic guards to solve the empty input problem in DeepTauId are in place, but the reason of the empty grid needs to be investigated with Tau experts.

A more general guard for empty inputs will be added (see https://github.com/cms-sw/cmssw/issues/44481)

AdrianoDee commented 6 months ago

+pdmv (really only the reporter)

mmusich commented 6 months ago

... hlt will sign once the 14.0.X PR is merged and tested in IBs.

mmusich commented 6 months ago

but the reason of the empty grid needs to be investigated with Tau experts.

@cms-sw/reconstruction-l2 this looks like needs a separate issue. Can you open one?

mmusich commented 6 months ago

+hlt

no issues observed after the 14.0.X PR got merged and tested in IBs.

cmsbuild commented 6 months ago

This issue is fully signed and ready to be closed.

makortel commented 6 months ago

@cmsbuild, please close

cms-sw / cmssw