cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

HLT farm crash in run 381543 #45136

Closed mmusich closed 3 months ago

mmusich commented 3 months ago

Reporting the HLT farm crashes in run 381543.

To reproduce:

(to reproduce offline important go on lxplus901 as the CPU micro-architecture matters)

cmsrel CMSSW_14_0_7_patch1_MULTIARCHS
cd CMSSW_14_0_7_patch1_MULTIARCHS/src
cmsenv
#!/bin/bash -ex

# CMSSW_14_0_7_patch1

hltGetConfiguration run:381543 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0024_index000154_fu-c2b14-19-01_pid630325.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0056_index000226_fu-c2b1
4-19-01_pid630264.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0066_index000018_fu-c2b14-05-01_pid306574.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index0
00306_fu-c2b14-39-01_pid629490.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000326_fu-c2b14-39-01_pid629490.root,/store/group/tsg/FOG/error_stream_root/run381543/run381543_
ls0152_index000345_fu-c2b14-39-01_pid629490.root > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

results in:

2024-06-04 18:19:19.053636: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you
 can ignore this message): INVALID_ARGUMENT: scale must have the same number of elements as the channels of x, got 80 and 31
     [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/batch_normalization_CNN1x1_0/FusedBatchNormV3}}]]
----- Begin Fatal Exception 04-Jun-2024 18:19:19 CEST-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 381543 lumi: 24 event: 22910599 stream: 0
   [1] Running path 'HLT_VBF_DiPFJet45_Mjj750_PNetTauhPFJet45_L2NN_eta2p3_v3'
   [2] Calling method for module L2TauNNProducerAlpaka/'hltL2TauTagNNProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: scale must have the same number of elements as the channels of x, got 80 and 31
     [[{{node cnn_model/StatefulPartitionedCall/StatefulPartitionedCall/batch_normalization_CNN1x1_0/FusedBatchNormV3}}]]
----- End Fatal Exception -------------------------------------------------

This looks reminiscent of https://github.com/cms-sw/cmssw/issues/44333. As additional information it looks like the crashes are happening only on the new HLT nodes that have a different CPU micro-architecture where the AVX512F AVX512_VNNI instructions are present. I tested that:

FYI: @cms-sw/hlt-l2 @trocino @mzarucki @trtomei

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @mmusich.

@rappoccio, @smuzaffar, @antoniovilela, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 3 months ago

@brallmond FYI

mmusich commented 3 months ago

type tau

makortel commented 3 months ago

assign package RecoTauTag/HLTProducers

cmsbuild commented 3 months ago

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 3 months ago

Similar error was seen earlier in https://github.com/cms-sw/cmssw/issues/44333#issuecomment-1986808313

mmusich commented 3 months ago

assign ml

cmsbuild commented 3 months ago

New categories assigned: ml

@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

valsdav commented 3 months ago

Thanks for the reproducer @mmusich, I can have a look at the TF inputs.

mmusich commented 3 months ago

Do we need the same protections as https://github.com/cms-sw/cmssw/pull/44455 in RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc (suggestion from @missirol)

valsdav commented 3 months ago

I checked and this is indeed the case: in this point https://github.com/cms-sw/cmssw/blob/master/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc#L735, there is a call to the inference without checking the input nTau. I have a general fix to TensorFlow code here: I can prepare a PR tomorrow morning as it helps protecting us against this kind of problems.

This change patches the problem:

diff --git a/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc b/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc
index 9772366c6b2..91c5ceea6be 100644
--- a/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc
+++ b/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc
@@ -732,14 +732,18 @@ void L2TauNNProducerAlpaka::fillPatatracks(tensorflow::Tensor& cellGridMatrix,

 std::vector<float> L2TauNNProducerAlpaka::getTauScore(const tensorflow::Tensor& cellGridMatrix) {
-  std::vector<tensorflow::Tensor> pred_tensor;
-  tensorflow::run(L2cacheData_->session, {{inputTensorName_, cellGridMatrix}}, {outputTensorName_}, &pred_tensor);
   const int nTau = cellGridMatrix.shape().dim_size(0);
-  std::vector<float> pred_vector(nTau);
-  for (int tau_idx = 0; tau_idx < nTau; ++tau_idx) {
-    pred_vector[tau_idx] = pred_tensor[0].matrix<float>()(tau_idx, 0);
+  if (nTau == 0) {
+      return std::vector<float>();
+  }else{
+    std::vector<tensorflow::Tensor> pred_tensor;
+    tensorflow::run(L2cacheData_->session, {{inputTensorName_, cellGridMatrix}}, {outputTensorName_}, &pred_tensor);
+    std::vector<float> pred_vector(nTau);
+    for (int tau_idx = 0; tau_idx < nTau; ++tau_idx) {
+      pred_vector[tau_idx] = pred_tensor[0].matrix<float>()(tau_idx, 0);
+    }
+    
+    return pred_vector;
   }
-
-  return pred_vector;
 }

Should I open a PR for this @mmusich ?

mmusich commented 3 months ago

@valsdav, thanks for looking into this.

Should I open a PR ?

if your more general fix to the TF interface protects against this as well, then we should probably use that instead of patching client by client. Let me note that the L2TauTagNNProducerAlpaka could be a derived class from a template, to avoid code duplication from L2TauTagNNProducer. That's something for @cms-sw/tau-pog-l2 to consider. Finally, let me add that while a fix is highly desireable we're entering a technical stop so we don't need to push a hasty patch to avoid crashes online, but we have a bit of time for a better solution.

valsdav commented 3 months ago

I still think that the TF patch should be a safety net to avoid crashes but that the clients should check and avoid processing empty inputs. I can open a separate issue to track the "empty input protection" problem and list the packages that may be affected. In the meanwhile the TF PR is coming

brallmond commented 3 months ago

Hello, commenting from the Tau side as advised in the TSG meeting.

I would be in favor of having both the general protection (TF patch) that valsdav has opened another issue to implement, as well as the specific guards that were implemented previously for the DeepTau module. I think it makes sense to add the guards to the L2NN since they have worked well in the DeepTau module. If I understand correctly, neither of those sets of guards will be necessary once the TF patch is merged, but they won't hurt to have in place.

Thanks all for addressing the issue quickly.

Martin-Grunewald commented 3 months ago

@brallmond Indeed. Please provide L2NN PRs for 14_1 and 14_0.

@valsdav We'd also need a TF backport to 14_0.

missirol commented 3 months ago

For the record, this issue led to 10 HLT crashes in run-381543 and 29 HLT crashes in run-381544. With the corresponding error files, we verified that using #45145 there are no crashes in these events [*].

I understand both protections will be implemented. Certainly, HLT needs to deploy online a new release with at least one of these protections before the end of the current LHC stop (so, before Jun ~15).

[*]

#!/bin/bash -ex

# CMSSW_14_0_7_patch2_MULTIARCHS

hltGetConfiguration run:381543 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input \
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0024_index000154_fu-c2b14-19-01_pid630325.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0056_index000226_fu-c2b14-19-01_pid630264.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0066_index000018_fu-c2b14-05-01_pid306574.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000306_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000326_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0152_index000345_fu-c2b14-39-01_pid629490.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0229_index000038_fu-c2b14-19-01_pid630079.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0229_index000055_fu-c2b14-19-01_pid630079.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0269_index000056_fu-c2b14-17-01_pid587667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0269_index000108_fu-c2b14-17-01_pid587667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0274_index000072_fu-c2b14-21-01_pid586556.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0274_index000097_fu-c2b14-21-01_pid586556.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000199_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000305_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0313_index000322_fu-c2b05-22-01_pid3462225.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000005_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000006_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0383_index000096_fu-c2b14-17-01_pid587644.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0437_index000170_fu-c2b14-43-01_pid628778.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0437_index000177_fu-c2b14-43-01_pid628778.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381543/run381543_ls0502_index000042_fu-c2b14-33-01_pid627667.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0073_index000147_fu-c2b14-07-01_pid723678.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0073_index000396_fu-c2b14-25-01_pid624532.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000043_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000064_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0115_index000078_fu-c2b14-07-01_pid723823.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0178_index000310_fu-c2b14-17-01_pid626159.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0180_index000211_fu-c2b14-35-01_pid665686.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0187_index000409_fu-c2b14-15-01_pid667599.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000061_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000109_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0216_index000110_fu-c2b14-39-01_pid668710.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0272_index000144_fu-c2b14-43-01_pid675712.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0272_index000149_fu-c2b14-43-01_pid675712.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0273_index000030_fu-c2b14-37-01_pid667292.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0298_index000217_fu-c2b14-13-01_pid671154.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0298_index000221_fu-c2b14-13-01_pid671154.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0303_index000287_fu-c2b14-13-01_pid670560.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0303_index000318_fu-c2b14-13-01_pid670560.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0339_index000217_fu-c2b14-43-01_pid675735.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0339_index000237_fu-c2b14-43-01_pid675735.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0520_index000139_fu-c2b14-13-01_pid670950.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0744_index000034_fu-c2b14-43-01_pid676152.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0744_index000093_fu-c2b14-43-01_pid676152.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0799_index000298_fu-c2b14-19-01_pid669452.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0837_index000123_fu-c2b14-37-01_pid667329.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0837_index000133_fu-c2b14-37-01_pid667329.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0842_index000113_fu-c2b14-17-01_pid625748.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0842_index000124_fu-c2b14-17-01_pid625748.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0865_index000035_fu-c2b14-09-01_pid742325.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls0957_index000254_fu-c2b14-41-01_pid624662.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1059_index000063_fu-c2b14-23-01_pid666512.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1059_index000067_fu-c2b14-23-01_pid666512.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1124_index000173_fu-c2b14-23-01_pid666558.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1371_index000089_fu-c2b14-11-01_pid736600.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1431_index000139_fu-c2b14-11-01_pid736723.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1459_index000206_fu-c2b14-15-01_pid667534.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1459_index000238_fu-c2b14-15-01_pid667534.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1559_index000104_fu-c2b14-07-01_pid732989.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1559_index000111_fu-c2b14-07-01_pid732989.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1584_index000066_fu-c2b14-23-01_pid730878.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1700_index000149_fu-c2b14-19-01_pid669200.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1910_index000060_fu-c2b14-17-01_pid626082.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1910_index000073_fu-c2b14-17-01_pid626082.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls1916_index000196_fu-c2b14-19-01_pid669161.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls2174_index000141_fu-c2b14-11-01_pid737084.root,\
root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run381544/run381544_ls2174_index000145_fu-c2b14-11-01_pid737084.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log
mmusich commented 3 months ago

Indeed. Please provide L2NN PRs for 14_1 and 14_0.

to speed up things (even if IMHO they're not really so necessary) I created:

and tested explicitly that the setup at https://github.com/cms-sw/cmssw/issues/45136#issuecomment-2150786504 doesn't crash for any of the error stream files for run-381543 and run-381544.

mmusich commented 3 months ago

The following fixes were implemented:

all of them are merged and will be available in the next CMSSW_14_0_X release.

mmusich commented 3 months ago

+hlt

valsdav commented 3 months ago

+ml

cmsbuild commented 3 months ago

This issue is fully signed and ready to be closed.

mmusich commented 3 months ago

@cmsbuild, please close