Closed trtomei closed 2 months ago
cms-bot internal usage
A new Issue was created by @trtomei.
@makortel, @smuzaffar, @sextonkennedy, @rappoccio, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different: http://cmsonline.cern.ch/cms-elog/1213820
The stack traces that I posted above are the ones I got from F3mon, though.
For reference, here are the log files of the 2 jobs. old_hlt_run380399_pid552137.log old_hlt_run380399_pid2367851.log
As reported in the elog above, they both contain the following.
src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77:
void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc =
alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]:
block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.
assign heterogeneous, reconstruction
New categories assigned: heterogeneous,reconstruction
@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
So in both cases the assertion in https://github.com/cms-sw/cmssw/blob/8391245a8a5febd6fdaec1e66395b67f9d2dd3db/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h#L76-L80 fails, after which the processes get stuck, until they gets killed 6-7 hours later. (and because of the external termination the stack traces of only "the crashed thread" are meaningless alone)
The state of cmsRun
seems similar to https://github.com/cms-sw/cmssw/issues/44769#issuecomment-2066613569 : all the TBB threads are stuck in acquiring locks (old_hlt_run380399_pid2367851.log has two TBB threads that don't have any work to do), and the CUDA threads are doing something and presumably holding the lock.
FYI @AdrianoDee
Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different:
@trtomei, all circumstantial evidence points to vertex reconstruction. Can you edit the title to remove the EcalRawToDigi
mention?
For emphasis: external termination request means it timed out, likely due to a deadlock. The timeout signal will be delivered to thread 1, but to diagnose the deadlock we need to see the stack traces of all the threads.
Looks like this happened again. One LS (run 380624, LS 411) could not be "closed" because one process was stuck and had to be terminated manually hours later. The full log (see below) shows again
src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: [..]
block: [0,0,0], thread: [21,0,0] Assertion `wv[i] > 0.f` failed.
similarly to https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2098979035.
There should be no assert on GPU in first place. In my opinion this is something that was seen already in the past. in standard CUDA in CMSSW we never had the assert on.
The assert in CUDA were compiled out because they are slow. There is another thread on the subject. I understood that (by mistake) they are back since ALPAKA.
Of course the assert is still there on CPU but it will not print
block: [0,0,0], thread: [21,0,0]
so this is coming from CUDA and in the past it would not trigger!
Currently we do have asserts in Alpaka code, because the code is new and IMHO it's better to get meaningful errors with a well defined location than random failures.
Within alpaka code:
assert
are enabled like in host code, and so are enabled in CMSSW;ALPAKA_ASSERT
should behave in the same way as assert
- but I don't think we use them;ALPAKA_ASSERT_ACC
can be conditionally enabled on CPU and disabled on GPU, but for the time being we keep them enabled also on GPU.I understood that (by mistake) they are back since ALPAKA.
The plan is to change all assert(...)
in alpaka code to ALPAKA_ASSERT_ACC(...)
, then measure the impact of enabling/disabling them.
I'd probably keep them enabled for a 1%-2% slowdown, disable them for a 5% slowdown.
The fact that asserts may be causing deadlocks is a different matter: in principle, a failed assert
in gpu code should cause an asynchronous error, and the next CUDA runtime function should detect it and fail.
Do we have a way to reproduce the problem offline ?
here is the old issue on CPU
Did anybody reproduce the assert offline?
This crash seems to appear again in the collision run 381067. Attaching the error log to this comment old_hlt_run381067_pid1000282.log
We were unable to reproduce the error after running the HLT config again on the error stream files.
Do not see any sign of assert in the log from run 381067: can somebody post a script to reproduce it offline?
Do not see any sign of assert in the log from run 381067:
it's at line 36196 in the log attached:
src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]: block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.
can somebody post a script to reproduce it offline?
@gparida if you tried to reproduce offline, can you post the script you used here?
Note that (also) in this case, the job was stuck not processing data for a while, and was eventually killed.
I tried reproducing it (CMSSW_14_0_7_MULTIARCHS
) offline (unsuccessfully). Used the following:
#!/bin/bash -ex
# CMSSW_14_0_7_MULTIARCHS
hltGetConfiguration run:381067 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000095_fu-c2b04-28-01_pid1000282.root,'\
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000109_fu-c2b04-28-01_pid1000282.root'&> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt.py &> hlt.log
Also, tried to rerun the job with
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
Attaching the error log to this comment old_hlt_run381067_pid1000282.log
Umm, the log indicates the job had 992 threads spawned by OpenMP, all in state
Thread 1036 (Thread 0x7fb6e37e2700 (LWP 1012080) "cmsRun"):
#0 futex_wait (val=11872, addr=0x7fbae8882b44) at ../../../libgomp/config/linux/x86/futex.h:97
#1 do_wait (val=11872, addr=0x7fbae8882b44) at ../../../libgomp/config/linux/wait.h:67
#2 gomp_barrier_wait_end (bar=0x7fbae8882b40, state=11872) at ../../../libgomp/config/linux/bar.c:48
#3 0x00007fbb369263e8 in gomp_simple_barrier_wait (bar=0x7fbae8882b40) at ../../../libgomp/config/posix/simple-bar.h:60
#4 gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:133
#5 0x00007fbc4c0191ca in start_thread () from /lib64/libpthread.so.0
#6 0x00007fbc4bc84e73 in clone () from /lib64/libc.so.6
What is bringing OpenMP threading?
first library in the list! ldd RivetGeneratorInterfaceRivetInterface.so | grep gomp libgomp.so.1 => /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02838/slc7_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libgomp.so.1 (0x00007f91271d0000)
more libPhysicsToolsPatAlgos.so libRecoEgammaPhotonIdentification.so pluginGeneratorInterfaceRivetInterface_plugins.so pluginPhysicsToolsNanoAODPlugins.so pluginPhysicsToolsNanoAODRNuplePlugins.so pluginPhysicsToolsPatAlgos_plugins.so pluginPhysicsToolsPatExamples_plugins.so pluginPhysicsToolsScoutingPlugins.so
and few more
maybe we should set
export OMP_NUM_THREADS=1
so used gdb and info threads it looks to me that each cmssw thread spawn ncpu-1 omp threads all sitting in
103 Thread 0x7fd468fb2700 (LWP 107236) "cmsRun" futex_wait (val=32, addr=0x7fd5051e5dc4) at ../../../libgomp/config/linux/x86/futex.h:97
one can break gomp_thread_start
and see each of them starting
export OMP_NUM_THREADS=1 will make gomp_thread_start
never called as gomp uses the main thread as thread 0...
I thought we already had calls in cmsRun to limit OMP threads. If we don't, we should. I do not like relying on environment variables for this.
The culprit seems to be XGBoost
#0 0x00007ffff51a8e65 in clone () from /lib64/libc.so.6
#1 0x00007ffff553b08f in create_thread () from /lib64/libpthread.so.0
#2 0x00007ffff553ca4b in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3 0x00007fff724f9a10 in gomp_team_start (
fn=fn@entry=0x7fff72922630 <void xgboost::common::ParallelFor<unsigned long, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}>(unsigned long, int, xgboost::common::Sched, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}) [clone ._omp_fn.3]>, data=data@entry=0x7fffffff28a0,
nthreads=nthreads@entry=16, flags=flags@entry=0, team=0x7fff4d1ac400, taskgroup=taskgroup@entry=0x0) at ../../../libgomp/team.c:858
#4 0x00007fff724f06e1 in GOMP_parallel (
fn=0x7fff72922630 <void xgboost::common::ParallelFor<unsigned long, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}>(unsigned long, int, xgboost::common::Sched, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}) [clone ._omp_fn.3]>, data=0x7fffffff28a0, num_threads=16, flags=0)
at ../../../libgomp/parallel.c:176
#5 0x00007fff7291dd7b in void xgboost::common::ParallelFor<unsigned long, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}>(unsigned long, int, xgboost::common::Sched, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}) ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#6 0x00007fff7291e051 in void xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int) ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#7 0x00007fff729315e6 in xgboost::predictor::CPUPredictor::PredictDMatrix(xgboost::DMatrix*, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int) const ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#8 0x00007fff727b92e8 in xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#9 0x00007fff727fe745 in xgboost::LearnerImpl::Predict(std::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, unsigned int, bool, bool, bool, bool, bool) ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#10 0x00007fff7263e843 in XGBoosterPredictFromDMatrix () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#11 0x00007fff72b8a5ea in PhotonXGBoostEstimator::computeMva(float, float, float, float, float, float, float, float, float) const ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoEgammaPhotonIdentification.so
#12 0x00007fff72bf5a6f in PhotonXGBoostProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const ()
from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginRecoEgammaPhotonIdentificationPlugins.so
@smorovic FYI
the number of threads can be set in xgboost https://xgboost.readthedocs.io/en/stable/parameter.html
stil I would prefer to limit the number of OMP thread (to 1!) globally
btw apparently xgboost can run on gpu as well....
WAIT: did we already have this discussion once?
The culprit seems to be XGBoost
which was added as a dependency of RecoEgamma/PhotonIdentification
in CMSSW_14_0_4
/#44500.
The culprit seems to be XGBoost
which was added as a dependency of
RecoEgamma/PhotonIdentification
inCMSSW_14_0_4
/#44500.
Let's tag @cms-sw/ml-l2 too
with the question: why XGBoost and not cmssw GBTree? For training anything can be used, for inference better to stick to the preferred (and optmized) implementation
with the question: why XGBoost and not cmssw GBTree?
Was discussed in https://github.com/cms-sw/cmssw/issues/44542#issuecomment-2068155042 and following
I suggest as immediate action at least at HLT to limit the number of OMP threads to 1!
I suggest as immediate action that we remove this crap from CMSSW.
the number of threads can be set in xgboost https://xgboost.readthedocs.io/en/stable/parameter.html
@smorovic is it something we do directly in the plugin added in https://github.com/cms-sw/cmssw/pull/44500 ? Can you have a look?
Note that if XGBoost uses multiple threads, the timing measurements will fail to account for the time it uses.
would this
diff --git a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
index 8b5cae21c48..4ebb8b6efd3 100644
--- a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
+++ b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
@@ -3,6 +3,9 @@
PhotonXGBoostEstimator::PhotonXGBoostEstimator(const edm::FileInPath& weightsFile, int best_ntree_limit) {
XGBoosterCreate(NULL, 0, &booster_);
+ // set number of threads to 1, see https://github.com/cms-sw/cmssw/issues/44923
+ XGBoosterSetParam(booster_, "nthread", "1");
+
XGBoosterLoadModel(booster_, weightsFile.fullPath().c_str());
best_ntree_limit_ = best_ntree_limit;
be enough? I tried this
so used gdb and info threads
and I see:
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff469b640 (LWP 1842094) "cmsRun" __GI__dl_debug_state () at dl-debug.c:74
* 2 Thread 0x7fffb8e9c700 (LWP 1843076) "cmsRun" 0x00007ffff551aab4 in read () from /lib64/libpthread.so.0
3 Thread 0x7fffb85f6700 (LWP 1843083) "cuda0000340000e" 0x00007ffff5268301 in poll () from /lib64/libc.so.6
4 Thread 0x7fff9ffff700 (LWP 1843086) "cuda-EvtHandlr" 0x00007ffff5268301 in poll () from /lib64/libc.so.6
5 Thread 0x7fff5f65b700 (LWP 1843284) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
6 Thread 0x7fff5ee5a700 (LWP 1843285) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
7 Thread 0x7fff5e659700 (LWP 1843286) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
8 Thread 0x7fff5de58700 (LWP 1843287) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
9 Thread 0x7fff5d657700 (LWP 1843288) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
10 Thread 0x7fff5ce56700 (LWP 1843289) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
11 Thread 0x7fff5c655700 (LWP 1843290) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
12 Thread 0x7fff5be54700 (LWP 1843291) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
13 Thread 0x7fff5b653700 (LWP 1843292) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
14 Thread 0x7fff5ae52700 (LWP 1843293) "cmsRun" 0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
15 Thread 0x7fff5a651700 (LWP 1843294) "cmsRun" 0x00007ffff551b180 in nanosleep () from /lib64/libpthread.so.0
16 Thread 0x7fff59e50700 (LWP 1843295) "cmsRun" 0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
17 Thread 0x7fff5964f700 (LWP 1843296) "cmsRun" 0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
18 Thread 0x7fff58e4e700 (LWP 1843297) "cmsRun" 0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
19 Thread 0x7fff4581a700 (LWP 1843407) "cmsRun" 0x00007ffff551745c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
20 Thread 0x7fff45019700 (LWP 1843408) "cmsRun" 0x00007ffff551745c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
21 Thread 0x7fff43fff700 (LWP 1843409) "cmsRun" 0x00007ffff551745c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
23 Thread 0x7fff298e1700 (LWP 1844089) "cmsRun" 0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
24 Thread 0x7fff06a87700 (LWP 1844090) "cmsRun" 0x00007ffff517d9bd in syscall () from /lib64/libc.so.6
Hello, this doesn't look good as apparently jobs use over 1000 threads. So either we should find a fix quickly or disable those paths (maybe by prescaling paths to 0 or removing from the menu).
Quick search reveals to set this enviroment: os.environ['OMP_NUM_THREADS'] = "1"
XGBoosterSetParam(booster_, "nthread", "1");
suggested by @mmusich could also work, I have to try.
I can patch cmsRun start scripts of hltd to insert the environment and I should find a machine to try this. Generally, it makes moving to GBTree, which I still didn't get to start working on, a priority.
I added this environment to fu-c2b05-42-01
which belongs to ZDC partition and thus out of the run, and manually attached to one RUBU (c2a11-07-01
), it will stay there until the run ends.
ps huH p 3206777 | wc -l
44
this was 1036 on other machines, so it seems to work in limiting threads. I will check in a while in case it doesn't grow.
I would suggest to deploy this script to all FUs for the time being, maybe after this fill (or if dump is late, then in the morning). @fwyzard, do you agree? Later I will try with XGBoosterSetParam.
would this
diff --git a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc index 8b5cae21c48..4ebb8b6efd3 100644 --- a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc +++ b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc @@ -3,6 +3,9 @@ PhotonXGBoostEstimator::PhotonXGBoostEstimator(const edm::FileInPath& weightsFile, int best_ntree_limit) { XGBoosterCreate(NULL, 0, &booster_); + // set number of threads to 1, see https://github.com/cms-sw/cmssw/issues/44923 + XGBoosterSetParam(booster_, "nthread", "1"); + XGBoosterLoadModel(booster_, weightsFile.fullPath().c_str()); best_ntree_limit_ = best_ntree_limit;
I also tried this, with fileListMode on a non-used HLT node, and it has eqivalent effect as export OMP_NUM_THREAD=1
,
In both cases there are no "omp" threads running (otherwise there are many).
I also tried this, with fileListMode on a non-used HLT node, and it has eqivalent effect as export OMP_NUM_THREAD=1, In both cases there are no "omp" threads running (otherwise there are many).
thanks for the test, I opened https://github.com/cms-sw/cmssw/pull/45042. If it's viable we can backport it and create an urgent patch (tagging also @francescobrivio as ORM)
I would suggest to deploy this script to all FUs for the time being, maybe after this fill (or if dump is late, then in the morning). @fwyzard, do you agree?
@smorovic @fwyzard given building a patch release (even if urgent) might take few days, especially considering it's the weekend, will you apply the patch on hltd today/tomorrow? (Even tho IIUC there is no extreme urgency from the operational point of view)
I would suggest to deploy this script to all FUs for the time being, maybe after this fill (or if dump is late, then in the morning). @fwyzard, do you agree?
@smorovic @fwyzard given building a patch release (even if urgent) might take few days, especially considering it's the weekend, will you apply the patch on hltd today/tomorrow? (Even tho IIUC there is no extreme urgency from the operational point of view)
We deployed the environment variable in the startup script a few minutes ago (from run 381188), we can keep it until the patch release.
Thanks a lot @smorovic!
Crashes observed in collisions Run 380399. Stack traces:
We tried to reproduce it with the following recipe, but it didn't reproduce.
Message #8 in the first stack trace seems to point to
alpaka_cuda_async::EcalRawToDigiPortable::produce()
method.@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI
Best regards, Thiago (for FOG)